```
library(dplyr)
library(parsnip)
library(workflows)
library(recipes)
library(epiprocess)
library(epipredict)
```

## Goals for the package

At a high level, our goal with epipredict is to make running simple Machine Learning / Statistical forecasters for epidemiology easy. However, this package is extremely extensible, and that is part of its utility. Our hope is that it is easy for users with epi training and some statistics to fit baseline models while still allowing those with more nuanced statistical understanding to create complicated specializations using the same framework.

Serving both populations is the main motivation for our efforts, but at the same time, we have tried hard to make it useful.

### Baseline models

We provide a set of basic, easy-to-use forecasters that work out of the box. You should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below).

For the basic forecasters, we provide:

- Baseline flat-line forecaster
- Autoregressive forecaster
- Autoregressive classifier

All the forcasters we provide are built on our framework. So we will use these basic models to illustrate its flexibility.

### Forecasting framework

Our framework for creating custom forecasters views the prediction task as a set of modular components. There are four types of components:

- Preprocessor: make transformations to the data before model training
- Trainer: train a model on data, resulting in a fitted model object
- Predictor: make predictions, using a fitted model object and processed test data
- Postprocessor: manipulate or transform the predictions before returning

Users familiar with `{tidymodels}`

and
especially the `{workflows}`

package will notice a lot of overlap. This is by design, and is in fact
a feature. The truth is that epipredict is a wrapper
around much that is contained in these packages. Therefore, if you want
something from this -verse, it should “just work” (we hope).

The reason for the overlap is that workflows
*already implements* the first three steps. And it does this very
well. However, it is missing the postprocessing stage and currently has
no plans for such an implementation. And this feature is important. The
baseline forecaster we provide *requires* postprocessing.
Anything more complicated needs this as well.

The second omission from tidymodels is support for panel data. Besides epidemiological data, economics, psychology, sociology, and many other areas frequently deal with data of this type. So the framework of behind epipredict implements this. In principle, this has nothing to do with epidemiology, and one could simply use this package as a solution for the missing functionality in tidymodels. Again, this should “just work”.

All of the *panel data* functionality is implemented through
the `epi_df`

data type in the companion `{epiprocess}`

package. There is much more to see there, but for the moment, it’s
enough to look at a simple one:

```
jhu <- case_death_rate_subset
jhu
#> An `epi_df` object, 20,496 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2022-05-31 19:08:25.791826
#>
#> # A tibble: 20,496 × 4
#> geo_value time_value case_rate death_rate
#> * <chr> <date> <dbl> <dbl>
#> 1 ak 2020-12-31 35.9 0.158
#> 2 al 2020-12-31 65.1 0.438
#> 3 ar 2020-12-31 66.0 1.27
#> 4 as 2020-12-31 0 0
#> 5 az 2020-12-31 76.8 1.10
#> 6 ca 2020-12-31 96.0 0.751
#> 7 co 2020-12-31 35.8 0.649
#> 8 ct 2020-12-31 52.1 0.819
#> 9 dc 2020-12-31 31.0 0.601
#> 10 de 2020-12-31 65.2 0.807
#> # ℹ 20,486 more rows
```

This data is built into the package and contains the measured
variables `case_rate`

and `death_rate`

for
COVID-19 at the daily level for each US state for the year 2021. The
“panel” part is because we have repeated measurements across a number of
locations.

The `epi_df`

encodes the time stamp as
`time_value`

and the `key`

as
`geo_value`

. While these 2 names are required, the values
don’t need to actually represent such objects. Additional
`key`

’s are also supported (like age group, ethnicity,
taxonomy, etc.).

The `epi_df`

also contains some metadata that describes
the keys as well as the vintage of the data. It’s possible that data
collected at different times for the *same set* of
`geo_value`

’s and `time_value`

’s could actually be
different. For more details, see `{epiprocess}`

.

### Why doesn’t this package already exist?

As described above:

Parts actually DO exist. There’s a universe called tidymodels. It handles preprocessing, training, and prediction, bound together, through a package called workflows. We built epipredict on top of that setup. In this way, you CAN use almost everything they provide.

However, workflows doesn’t do postprocessing. And nothing in the -verse handles

*panel data*.The tidy-team doesn’t have plans to do either of these things. (We checked).

There are two packages that do

*time series*built on tidymodels, but it’s “basic” time series: 1-step AR models, exponential smoothing, STL decomposition, etc.^{1}Our group has not prioritized these sorts of models for epidemic forecasting, but one could also integrate these methods into our framework.

## Show me the basics

We start with the `jhu`

data displayed above. One of the
“canned” forecasters we provide is an autoregressive forecaster with (or
without) covariates that *directly* trains on the response. This
is in contrast to a typical “iterative” AR model that trains to predict
one-step-ahead, and then plugs in the predictions to “leverage up” to
longer horizons.

We’ll estimate the model jointly across all locations using only the most recent 30 days.

```
jhu <- jhu %>% filter(time_value >= max(time_value) - 30)
out <- arx_forecaster(
jhu,
outcome = "death_rate",
predictors = c("case_rate", "death_rate")
)
```

The `out`

object has two components:

- The predictions which is just another
`epi_df`

. It contains the predictions for each location along with additional columns. By default, these are a 90% predictive interval, the`forecast_date`

(the date on which the forecast was putatively made) and the`target_date`

(the date for which the forecast is being made).

```
out$predictions
#> # A tibble: 56 × 5
#> geo_value .pred .pred_distn forecast_date target_date
#> <chr> <dbl> <dist> <date> <date>
#> 1 ak 0.355 quantiles(0.36)[2] 2021-12-31 2022-01-07
#> 2 al 0.325 quantiles(0.32)[2] 2021-12-31 2022-01-07
#> 3 ar 0.496 quantiles(0.5)[2] 2021-12-31 2022-01-07
#> 4 as 0.0836 quantiles(0.2)[2] 2021-12-31 2022-01-07
#> 5 az 0.614 quantiles(0.61)[2] 2021-12-31 2022-01-07
#> 6 ca 0.327 quantiles(0.33)[2] 2021-12-31 2022-01-07
#> 7 co 0.567 quantiles(0.57)[2] 2021-12-31 2022-01-07
#> 8 ct 0.544 quantiles(0.54)[2] 2021-12-31 2022-01-07
#> 9 dc 0.831 quantiles(0.83)[2] 2021-12-31 2022-01-07
#> 10 de 0.607 quantiles(0.61)[2] 2021-12-31 2022-01-07
#> # ℹ 46 more rows
```

- A list object of class
`epi_workflow`

. This object encapsulates all the instructions necessary to create the prediction. More details on this below.

```
out$epi_workflow
#>
#> ══ Epi Workflow [trained] ══════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> Postprocessor: Frosting
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#>
#> 6 Recipe steps.
#> 1. step_epi_lag()
#> 2. step_epi_lag()
#> 3. step_epi_ahead()
#> 4. step_naomit()
#> 5. step_naomit()
#> 6. step_training_window()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) lag_0_case_rate lag_7_case_rate lag_14_case_rate
#> 0.0829475 0.0009830 0.0027035 -0.0005651
#> lag_0_death_rate lag_7_death_rate lag_14_death_rate
#> 0.2466110 0.1964921 0.0752998
#>
#> ── Postprocessor ───────────────────────────────────────────────────────────────
#>
#> 5 Frosting layers.
#> 1. layer_predict()
#> 2. layer_residual_quantiles()
#> 3. layer_add_forecast_date()
#> 4. layer_add_target_date()
#> 5. layer_threshold()
#>
```

By default, the forecaster predicts the outcome
(`death_rate`

) 1-week ahead, using 3 lags of each predictor
(`case_rate`

and `death_rate`

) at 0 (today), 1
week back and 2 weeks back. The predictors and outcome can be changed
directly. The rest of the defaults are encapsulated into a list of
arguments. This list is produced by `arx_args_list()`

.

### Simple adjustments

Basic adjustments can be made through the `args_list`

.

```
out2week <- arx_forecaster(
jhu,
outcome = "death_rate",
predictors = c("case_rate", "death_rate"),
args_list = arx_args_list(
lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)),
ahead = 14
)
)
```

Here, we’ve used different lags on the `case_rate`

and are
now predicting 2 weeks ahead. This example also illustrates a major
difficulty with the “iterative” versions of AR models. This model
doesn’t produce forecasts for `case_rate`

, and so, would not
have data to “plug in” for the necessary lags.^{2}

Another property of the basic model is the predictive interval. We describe this in more detail in a different vignette, but it is easy to request multiple quantiles.

```
out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
args_list = arx_args_list(
quantile_levels = c(.01, .025, seq(.05, .95, by = .05), .975, .99)
)
)
```

The column `.pred_dstn`

in the `predictions`

object is actually a “distribution” here parameterized by its quantiles.
For this default forecaster, these are created using the quantiles of
the residuals of the predictive model (possibly symmetrized). Here, we
used 23 quantiles, but one can grab a particular quantile,

```
head(quantile(out_q$predictions$.pred_distn, p = .4))
#> [1] 0.30277798 0.27213225 0.44345734 0.03120647 0.56121844 0.27492711
```

or extract the entire distribution into a “long” `epi_df`

with `quantile_levels`

being the probability and
`values`

being the value associated to that quantile.

```
out_q$predictions %>%
# first create a "nested" list-column
mutate(.pred_distn = nested_quantiles(.pred_distn)) %>%
unnest(.pred_distn) # then unnest it
#> # A tibble: 1,288 × 6
#> geo_value .pred values quantile_levels forecast_date target_date
#> <chr> <dbl> <dbl> <dbl> <date> <date>
#> 1 ak 0.355 0 0.01 2021-12-31 2022-01-07
#> 2 ak 0.355 0 0.025 2021-12-31 2022-01-07
#> 3 ak 0.355 0.0371 0.05 2021-12-31 2022-01-07
#> 4 ak 0.355 0.123 0.1 2021-12-31 2022-01-07
#> 5 ak 0.355 0.174 0.15 2021-12-31 2022-01-07
#> 6 ak 0.355 0.211 0.2 2021-12-31 2022-01-07
#> 7 ak 0.355 0.237 0.25 2021-12-31 2022-01-07
#> 8 ak 0.355 0.260 0.3 2021-12-31 2022-01-07
#> 9 ak 0.355 0.282 0.35 2021-12-31 2022-01-07
#> 10 ak 0.355 0.303 0.4 2021-12-31 2022-01-07
#> # ℹ 1,278 more rows
```

Additional simple adjustments to the basic forecaster can be made using the function:

```
arx_args_list(
lags = c(0L, 7L, 14L), ahead = 7L, n_training = Inf,
forecast_date = NULL, target_date = NULL, quantile_levels = c(0.05, 0.95),
symmetrize = TRUE, nonneg = TRUE, quantile_by_key = character(0L),
nafill_buffer = Inf
)
```

### Changing the engine

So far, our forecasts have been produced using simple linear
regression. But this is not the only way to estimate such a model. The
`trainer`

argument determines the type of model we want. This
takes a `{parsnip}`

model.
The default is linear regression, but we could instead use a random
forest with the ranger package:

```
out_rf <- arx_forecaster(
jhu, "death_rate", c("case_rate", "death_rate"),
rand_forest(mode = "regression")
)
```

Or boosted regression trees with xgboost:

```
out_gb <- arx_forecaster(
jhu, "death_rate", c("case_rate", "death_rate"),
boost_tree(mode = "regression", trees = 20)
)
```

Or quantile regression, using our custom forecasting engine
`quantile_reg()`

:

```
out_qr <- arx_forecaster(
jhu, "death_rate", c("case_rate", "death_rate"),
quantile_reg()
)
```

FWIW, this last case (using quantile regression), is not far from what the Delphi production forecast team used for its Covid forecasts over the past few years.

### Inner workings

Underneath the hood, this forecaster creates (and returns) an
`epi_workflow`

. Essentially, this is a big S3 object that
wraps up the 4 modular steps (preprocessing - postprocessing) described
above.

#### Preprocessing

Preprocessing is accomplished through a `recipe`

(imagine
baking a cake) as provided in the `{recipes}`

package. We’ve made a few modifications (to handle panel data) as well
as added some additional options. The recipe gives a specification of
how to handle training data. Think of it like a fancified
`formula`

that you would pass to `lm()`

:
`y ~ x1 + log(x2)`

. In general, there are 2 extensions to the
`formula`

that recipes handles:

- Doing transformations of both training and test data that can always be applied. These are things like taking the log of a variable, leading or lagging, filtering out rows, handling dummy variables, etc.
- Using statistics from the training data to eventually process test data. This is a major benefit of recipes. It prevents what the tidy team calls “data leakage”. A simple example is centering a predictor by its mean. We need to store the mean of the predictor from the training data and use that value on the test data rather than accidentally calculating the mean of the test predictor for centering.

A recipe is processed in 2 steps, first it is “prepped”. This
calculates and stores any intermediate statistics necessary for use on
the test data. Then it is “baked” resulting in training data ready for
passing into a statistical model (like `lm`

).

We have introduced an `epi_recipe`

. It’s just a
`recipe`

that knows how to handle the
`time_value`

, `geo_value`

, and any additional keys
so that these are available when necessary.

The `epi_recipe`

from `out_gb`

can be extracted
from the result:

`extract_recipe(out_gb$epi_workflow)`

The “Inputs” are the original `epi_df`

and the “roles”
that these are assigned. None of these are predictors or outcomes. Those
will be created by the recipe when it is prepped. The “Operations” are
the sequence of instructions to create the cake (baked training data).
Here we create lagged predictors, lead the outcome, and then remove
`NA`

s. Some models like `lm`

internally handle
`NA`

s, but not everything does, so we deal with them
explicitly. The code to do this (inside the forecaster) is

```
er <- epi_recipe(jhu) %>%
step_epi_lag(case_rate, death_rate, lag = c(0, 7, 14)) %>%
step_epi_ahead(death_rate, ahead = 7) %>%
step_epi_naomit()
```

While recipes provides a function
`step_lag()`

, it assumes that the data have no breaks in the
sequence of `time_values`

. This is a bit dangerous, so we
avoid that behaviour. Our `lag/ahead`

functions also
appropriately adjust the amount of data to avoid accidentally dropping
recent predictors from the test data.

#### The model specification

Users with familiarity with the parsnip package will
have no trouble here. Basically, parsnip unifies the
function signature across statistical models. For example,
`lm()`

“likes” to work with formulas, but
`glmnet::glmnet()`

uses `x`

and `y`

for
predictors and response. parsnip is agnostic. Both of
these do “linear regression”. Above we switched from `lm()`

to `xgboost()`

without any issue despite the fact that these
functions couldn’t be more different.

```
lm(formula, data, subset, weights, na.action,
method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...
)
xgboost(
data = NULL, label = NULL, missing = NA, weight = NULL,
params = list(), nrounds, verbose = 1, print_every_n = 1L,
early_stopping_rounds = NULL, maximize = NULL, save_period = NULL,
save_name = "xgboost.model", xgb_model = NULL, callbacks = list(),
...
)
```

epipredict provides a few engines/modules (the flatline forecaster and quantile regression), but you should be able to use any available models listed here.

To estimate (fit) a preprocessed model, one calls `fit()`

on the `epi_workflow`

.

`ewf <- epi_workflow(er, linear_reg()) %>% fit(jhu)`

#### Postprocessing

To stretch the metaphor of preparing a cake to its natural limits, we
have created postprocessing functionality called “frosting”. Much like
the recipe, each postprocessing operation is a “layer” and we “slather”
these onto our baked cake. To fix ideas, below is the postprocessing
`frosting`

for `arx_forecaster()`

`extract_frosting(out_q$epi_workflow)`

Here we have 5 layers of frosting. The first generates the forecasts from the test data. The second uses quantiles of the residuals to create distributional forecasts. The next two add columns for the date the forecast was made and the date for which it is intended to occur. Because we are predicting rates, they should be non-negative, so the last layer thresholds both predicted values and intervals at 0. The code to do this (inside the forecaster) is

```
f <- frosting() %>%
layer_predict() %>%
layer_residual_quantiles(
quantile_levels = c(.01, .025, seq(.05, .95, by = .05), .975, .99),
symmetrize = TRUE
) %>%
layer_add_forecast_date() %>%
layer_add_target_date() %>%
layer_threshold(starts_with(".pred"))
```

At predict time, we add this object onto the
`epi_workflow`

and call `forecast()`

```
ewf %>%
add_frosting(f) %>%
forecast()
#> An `epi_df` object, 56 x 6 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2022-05-31 19:08:25.791826
#>
#> # A tibble: 56 × 6
#> geo_value time_value .pred .pred_distn forecast_date target_date
#> * <chr> <date> <dbl> <dist> <date> <date>
#> 1 ak 2021-12-31 0.355 quantiles(0.36)[23] 2021-12-31 2022-01-07
#> 2 al 2021-12-31 0.325 quantiles(0.32)[23] 2021-12-31 2022-01-07
#> 3 ar 2021-12-31 0.496 quantiles(0.5)[23] 2021-12-31 2022-01-07
#> 4 as 2021-12-31 0.0836 quantiles(0.08)[23] 2021-12-31 2022-01-07
#> 5 az 2021-12-31 0.614 quantiles(0.61)[23] 2021-12-31 2022-01-07
#> 6 ca 2021-12-31 0.327 quantiles(0.33)[23] 2021-12-31 2022-01-07
#> 7 co 2021-12-31 0.567 quantiles(0.57)[23] 2021-12-31 2022-01-07
#> 8 ct 2021-12-31 0.544 quantiles(0.54)[23] 2021-12-31 2022-01-07
#> 9 dc 2021-12-31 0.831 quantiles(0.83)[23] 2021-12-31 2022-01-07
#> 10 de 2021-12-31 0.607 quantiles(0.61)[23] 2021-12-31 2022-01-07
#> # ℹ 46 more rows
```

The above `get_test_data()`

function examines the recipe
and ensures that enough test data is available to create the necessary
lags and produce a prediction for the desired future time point (after
the end of the training data). This mimics what would happen if
`jhu`

contained the most recent available historical data and
we wanted to actually predict the future. We could have instead used any
test data that contained the necessary predictors.

### Conclusion

Internally, we provide some simple functions to create reasonable forecasts. But ideally, a user could create their own forecasters by building up the components we provide. In other vignettes, we try to walk through some of these customizations.

To illustrate everything above, here is (roughly) the code for the
`flatline_forecaster()`

applied to the
`case_rate`

.

```
r <- epi_recipe(jhu) %>%
step_epi_ahead(case_rate, ahead = 7, skip = TRUE) %>%
update_role(case_rate, new_role = "predictor") %>%
add_role(all_of(key_colnames(jhu)), new_role = "predictor")
f <- frosting() %>%
layer_predict() %>%
layer_residual_quantiles() %>%
layer_add_forecast_date() %>%
layer_add_target_date() %>%
layer_threshold(starts_with(".pred"))
eng <- linear_reg() %>% set_engine("flatline")
wf <- epi_workflow(r, eng, f) %>% fit(jhu)
preds <- forecast(wf)
```

All that really differs from the `arx_forecaster()`

is the
`recipe`

, the test data, and the engine. The
`frosting`

is identical, as is the fitting and predicting
procedure.

```
preds
#> An `epi_df` object, 56 x 6 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2022-05-31 19:08:25.791826
#>
#> # A tibble: 56 × 6
#> geo_value time_value .pred .pred_distn forecast_date target_date
#> * <chr> <date> <dbl> <dist> <date> <date>
#> 1 ak 2021-12-31 36.4 quantiles(39.58)[2] 2021-12-31 2022-01-07
#> 2 al 2021-12-31 89.9 quantiles(89.92)[2] 2021-12-31 2022-01-07
#> 3 ar 2021-12-31 82.6 quantiles(82.58)[2] 2021-12-31 2022-01-07
#> 4 as 2021-12-31 0 quantiles(21.39)[2] 2021-12-31 2022-01-07
#> 5 az 2021-12-31 58.3 quantiles(58.28)[2] 2021-12-31 2022-01-07
#> 6 ca 2021-12-31 84.4 quantiles(84.37)[2] 2021-12-31 2022-01-07
#> 7 co 2021-12-31 106. quantiles(105.83)[2] 2021-12-31 2022-01-07
#> 8 ct 2021-12-31 143. quantiles(143.1)[2] 2021-12-31 2022-01-07
#> 9 dc 2021-12-31 295. quantiles(295.03)[2] 2021-12-31 2022-01-07
#> 10 de 2021-12-31 150. quantiles(149.93)[2] 2021-12-31 2022-01-07
#> # ℹ 46 more rows
```