Goals for the package
At a high level, our goal with epipredict is to make running simple Machine Learning / Statistical forecasters for epidemiology easy. However, this package is extremely extensible, and that is part of its utility. Our hope is that it is easy for users with epi training and some statistics to fit baseline models while still allowing those with more nuanced statistical understanding to create complicated specializations using the same framework.
Serving both populations is the main motivation for our efforts, but at the same time, we have tried hard to make it useful.
Baseline models
We provide a set of basic, easy-to-use forecasters that work out of the box. You should be able to do a reasonably limited amount of customization on them. Any serious customization happens with the framework discussed below).
For the basic forecasters, we provide:
- Baseline flat-line forecaster
- Autoregressive forecaster
- Autoregressive classifier
All the forcasters we provide are built on our framework. So we will use these basic models to illustrate its flexibility.
Forecasting framework
Our framework for creating custom forecasters views the prediction task as a set of modular components. There are four types of components:
- Preprocessor: make transformations to the data before model training
- Trainer: train a model on data, resulting in a fitted model object
- Predictor: make predictions, using a fitted model object and processed test data
- Postprocessor: manipulate or transform the predictions before returning
Users familiar with {tidymodels}
and
especially the {workflows}
package will notice a lot of overlap. This is by design, and is in fact
a feature. The truth is that epipredict is a wrapper
around much that is contained in these packages. Therefore, if you want
something from this -verse, it should “just work” (we hope).
The reason for the overlap is that workflows already implements the first three steps. And it does this very well. However, it is missing the postprocessing stage and currently has no plans for such an implementation. And this feature is important. The baseline forecaster we provide requires postprocessing. Anything more complicated needs this as well.
The second omission from tidymodels is support for panel data. Besides epidemiological data, economics, psychology, sociology, and many other areas frequently deal with data of this type. So the framework of behind epipredict implements this. In principle, this has nothing to do with epidemiology, and one could simply use this package as a solution for the missing functionality in tidymodels. Again, this should “just work”.
All of the panel data functionality is implemented through
the epi_df
data type in the companion {epiprocess}
package. There is much more to see there, but for the moment, it’s
enough to look at a simple one:
jhu <- covid_case_death_rates
jhu
#> An `epi_df` object, 20,496 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2022-05-31
#>
#> # A tibble: 20,496 × 4
#> geo_value time_value case_rate death_rate
#> * <chr> <date> <dbl> <dbl>
#> 1 ak 2020-12-31 35.9 0.158
#> 2 al 2020-12-31 65.1 0.438
#> 3 ar 2020-12-31 66.0 1.27
#> 4 as 2020-12-31 0 0
#> 5 az 2020-12-31 76.8 1.10
#> 6 ca 2020-12-31 96.0 0.751
#> # ℹ 20,490 more rows
This data is built into the package and contains the measured
variables case_rate
and death_rate
for
COVID-19 at the daily level for each US state for the year 2021. The
“panel” part is because we have repeated measurements across a number of
locations.
The epi_df
encodes the time stamp as
time_value
and the key
as
geo_value
. While these 2 names are required, the values
don’t need to actually represent such objects. Additional
key
’s are also supported (like age group, ethnicity,
taxonomy, etc.).
The epi_df
also contains some metadata that describes
the keys as well as the vintage of the data. It’s possible that data
collected at different times for the same set of
geo_value
’s and time_value
’s could actually be
different. For more details, see {epiprocess}
.
Why doesn’t this package already exist?
As described above:
Parts actually DO exist. There’s a universe called tidymodels. It handles preprocessing, training, and prediction, bound together, through a package called workflows. We built epipredict on top of that setup. In this way, you CAN use almost everything they provide.
However, workflows doesn’t do postprocessing. And nothing in the -verse handles panel data.
The tidy-team doesn’t have plans to do either of these things. (We checked).
There are two packages that do time series built on tidymodels, but it’s “basic” time series: 1-step AR models, exponential smoothing, STL decomposition, etc.1 Our group has not prioritized these sorts of models for epidemic forecasting, but one could also integrate these methods into our framework.
Show me the basics
We start with the jhu
data displayed above. One of the
“canned” forecasters we provide is an autoregressive forecaster with (or
without) covariates that directly trains on the response. This
is in contrast to a typical “iterative” AR model that trains to predict
one-step-ahead, and then plugs in the predictions to “leverage up” to
longer horizons.
We’ll estimate the model jointly across all locations using only the most recent 30 days.
jhu <- jhu %>% filter(time_value >= max(time_value) - 30)
out <- arx_forecaster(
jhu,
outcome = "death_rate",
predictors = c("case_rate", "death_rate")
)
The out
object has two components:
- The predictions which is just another
epi_df
. It contains the predictions for each location along with additional columns. By default, these are a 90% predictive interval, theforecast_date
(the date on which the forecast was putatively made) and thetarget_date
(the date for which the forecast is being made).
out$predictions
#> # A tibble: 56 × 5
#> geo_value .pred .pred_distn forecast_date target_date
#> <chr> <dbl> <dist> <date> <date>
#> 1 ak 0.355 quantiles(0.36)[2] 2021-12-31 2022-01-07
#> 2 al 0.325 quantiles(0.32)[2] 2021-12-31 2022-01-07
#> 3 ar 0.496 quantiles(0.5)[2] 2021-12-31 2022-01-07
#> 4 as 0.0836 quantiles(0.2)[2] 2021-12-31 2022-01-07
#> 5 az 0.614 quantiles(0.61)[2] 2021-12-31 2022-01-07
#> 6 ca 0.327 quantiles(0.33)[2] 2021-12-31 2022-01-07
#> # ℹ 50 more rows
- A list object of class
epi_workflow
. This object encapsulates all the instructions necessary to create the prediction. More details on this below.
out$epi_workflow
#>
#> ══ Epi Workflow [trained] ═══════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> Postprocessor: Frosting
#>
#> ── Preprocessor ─────────────────────────────────────────────────────────────
#>
#> 6 Recipe steps.
#> 1. step_epi_lag()
#> 2. step_epi_lag()
#> 3. step_epi_ahead()
#> 4. step_naomit()
#> 5. step_naomit()
#> 6. step_training_window()
#>
#> ── Model ────────────────────────────────────────────────────────────────────
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) lag_0_case_rate lag_7_case_rate lag_14_case_rate
#> 0.0829475 0.0009830 0.0027035 -0.0005651
#> lag_0_death_rate lag_7_death_rate lag_14_death_rate
#> 0.2466110 0.1964921 0.0752998
#>
#> ── Postprocessor ────────────────────────────────────────────────────────────
#>
#> 5 Frosting layers.
#> 1. layer_predict()
#> 2. layer_residual_quantiles()
#> 3. layer_add_forecast_date()
#> 4. layer_add_target_date()
#> 5. layer_threshold()
#>
By default, the forecaster predicts the outcome
(death_rate
) 1-week ahead, using 3 lags of each predictor
(case_rate
and death_rate
) at 0 (today), 1
week back and 2 weeks back. The predictors and outcome can be changed
directly. The rest of the defaults are encapsulated into a list of
arguments. This list is produced by arx_args_list()
.
Simple adjustments
Basic adjustments can be made through the args_list
.
out2week <- arx_forecaster(
jhu,
outcome = "death_rate",
predictors = c("case_rate", "death_rate"),
args_list = arx_args_list(
lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)),
ahead = 14
)
)
Here, we’ve used different lags on the case_rate
and are
now predicting 2 weeks ahead. This example also illustrates a major
difficulty with the “iterative” versions of AR models. This model
doesn’t produce forecasts for case_rate
, and so, would not
have data to “plug in” for the necessary lags.2
Another property of the basic model is the predictive interval. We describe this in more detail in a different vignette, but it is easy to request multiple quantiles.
out_q <- arx_forecaster(jhu, "death_rate", c("case_rate", "death_rate"),
args_list = arx_args_list(
quantile_levels = c(.01, .025, seq(.05, .95, by = .05), .975, .99)
)
)
The column .pred_dstn
in the predictions
object is actually a “distribution” here parameterized by its quantiles.
For this default forecaster, these are created using the quantiles of
the residuals of the predictive model (possibly symmetrized). Here, we
used 23 quantiles, but one can grab a particular quantile,
round(head(quantile(out_q$predictions$.pred_distn, p = .4)), 3)
#> [1] 0.303 0.272 0.443 0.031 0.561 0.275
or extract the entire distribution into a “long” epi_df
with quantile_levels
being the probability and
values
being the value associated to that quantile.
out_q$predictions %>%
pivot_quantiles_longer(.pred_distn)
#> # A tibble: 1,288 × 6
#> geo_value .pred values quantile_levels forecast_date target_date
#> <chr> <dbl> <dbl> <dbl> <date> <date>
#> 1 ak 0.355 0 0.01 2021-12-31 2022-01-07
#> 2 ak 0.355 0 0.025 2021-12-31 2022-01-07
#> 3 ak 0.355 0.0371 0.05 2021-12-31 2022-01-07
#> 4 ak 0.355 0.123 0.1 2021-12-31 2022-01-07
#> 5 ak 0.355 0.174 0.15 2021-12-31 2022-01-07
#> 6 ak 0.355 0.211 0.2 2021-12-31 2022-01-07
#> # ℹ 1,282 more rows
Additional simple adjustments to the basic forecaster can be made using the function:
arx_args_list(
lags = c(0L, 7L, 14L), ahead = 7L, n_training = Inf,
forecast_date = NULL, target_date = NULL, quantile_levels = c(0.05, 0.95),
symmetrize = TRUE, nonneg = TRUE, quantile_by_key = character(0L),
nafill_buffer = Inf
)
Changing the engine
So far, our forecasts have been produced using simple linear
regression. But this is not the only way to estimate such a model. The
trainer
argument determines the type of model we want. This
takes a {parsnip}
model.
The default is linear regression, but we could instead use a random
forest with the ranger package:
out_rf <- arx_forecaster(
jhu,
outcome = "death_rate",
predictors = c("case_rate", "death_rate"),
trainer = rand_forest(mode = "regression")
)
Or boosted regression trees with xgboost:
out_gb <- arx_forecaster(
jhu,
outcome = "death_rate",
predictors = c("case_rate", "death_rate"),
trainer = boost_tree(mode = "regression", trees = 20)
)
Or quantile regression, using our custom forecasting engine
quantile_reg()
:
out_qr <- arx_forecaster(
jhu,
outcome = "death_rate",
predictors = c("case_rate", "death_rate"),
trainer = quantile_reg()
)
FWIW, this last case (using quantile regression), is not far from what the Delphi production forecast team used for its Covid forecasts over the past few years.
Inner workings
Underneath the hood, this forecaster creates (and returns) an
epi_workflow
. Essentially, this is a big S3 object that
wraps up the 4 modular steps (preprocessing - postprocessing) described
above.
Preprocessing
Preprocessing is accomplished through a recipe
(imagine
baking a cake) as provided in the {recipes}
package. We’ve made a few modifications (to handle panel data) as well
as added some additional options. The recipe gives a specification of
how to handle training data. Think of it like a fancified
formula
that you would pass to lm()
:
y ~ x1 + log(x2)
. In general, there are 2 extensions to the
formula
that recipes handles:
- Doing transformations of both training and test data that can always be applied. These are things like taking the log of a variable, leading or lagging, filtering out rows, handling dummy variables, etc.
- Using statistics from the training data to eventually process test data. This is a major benefit of recipes. It prevents what the tidy team calls “data leakage”. A simple example is centering a predictor by its mean. We need to store the mean of the predictor from the training data and use that value on the test data rather than accidentally calculating the mean of the test predictor for centering.
A recipe is processed in 2 steps, first it is “prepped”. This
calculates and stores any intermediate statistics necessary for use on
the test data. Then it is “baked” resulting in training data ready for
passing into a statistical model (like lm
).
We have introduced an epi_recipe
. It’s just a
recipe
that knows how to handle the
time_value
, geo_value
, and any additional keys
so that these are available when necessary.
The epi_recipe
from out_gb
can be extracted
from the result:
extract_recipe(out_gb$epi_workflow)
The “Inputs” are the original epi_df
and the “roles”
that these are assigned. None of these are predictors or outcomes. Those
will be created by the recipe when it is prepped. The “Operations” are
the sequence of instructions to create the cake (baked training data).
Here we create lagged predictors, lead the outcome, and then remove
NA
s. Some models like lm
internally handle
NA
s, but not everything does, so we deal with them
explicitly. The code to do this (inside the forecaster) is
er <- epi_recipe(jhu) %>%
step_epi_lag(case_rate, death_rate, lag = c(0, 7, 14)) %>%
step_epi_ahead(death_rate, ahead = 7) %>%
step_epi_naomit()
While recipes provides a function
step_lag()
, it assumes that the data have no breaks in the
sequence of time_values
. This is a bit dangerous, so we
avoid that behaviour. Our lag/ahead
functions also
appropriately adjust the amount of data to avoid accidentally dropping
recent predictors from the test data.
The model specification
Users with familiarity with the parsnip package will
have no trouble here. Basically, parsnip unifies the
function signature across statistical models. For example,
lm()
“likes” to work with formulas, but
glmnet::glmnet()
uses x
and y
for
predictors and response. parsnip is agnostic. Both of
these do “linear regression”. Above we switched from lm()
to xgboost()
without any issue despite the fact that these
functions couldn’t be more different.
lm(formula, data, subset, weights, na.action,
method = "qr",
model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
contrasts = NULL, offset, ...
)
xgboost(
data = NULL, label = NULL, missing = NA, weight = NULL,
params = list(), nrounds, verbose = 1, print_every_n = 1L,
early_stopping_rounds = NULL, maximize = NULL, save_period = NULL,
save_name = "xgboost.model", xgb_model = NULL, callbacks = list(),
...
)
epipredict provides a few engines/modules (the flatline forecaster and quantile regression), but you should be able to use any available models listed here.
To estimate (fit) a preprocessed model, one calls fit()
on the epi_workflow
.
ewf <- epi_workflow(er, linear_reg()) %>% fit(jhu)
Postprocessing
To stretch the metaphor of preparing a cake to its natural limits, we
have created postprocessing functionality called “frosting”. Much like
the recipe, each postprocessing operation is a “layer” and we “slather”
these onto our baked cake. To fix ideas, below is the postprocessing
frosting
for arx_forecaster()
extract_frosting(out_q$epi_workflow)
Here we have 5 layers of frosting. The first generates the forecasts from the test data. The second uses quantiles of the residuals to create distributional forecasts. The next two add columns for the date the forecast was made and the date for which it is intended to occur. Because we are predicting rates, they should be non-negative, so the last layer thresholds both predicted values and intervals at 0. The code to do this (inside the forecaster) is
f <- frosting() %>%
layer_predict() %>%
layer_residual_quantiles(
quantile_levels = c(.01, .025, seq(.05, .95, by = .05), .975, .99),
symmetrize = TRUE
) %>%
layer_add_forecast_date() %>%
layer_add_target_date() %>%
layer_threshold(starts_with(".pred"))
At predict time, we add this object onto the
epi_workflow
and call forecast()
ewf %>%
add_frosting(f) %>%
forecast()
#> An `epi_df` object, 56 x 6 with metadata:
#> * geo_type = state
#> * time_type = day
#> * other_keys = geo_value, time_value
#> * as_of = 2022-05-31
#>
#> # A tibble: 56 × 6
#> geo_value time_value .pred .pred_distn forecast_date target_date
#> * <chr> <date> <dbl> <dist> <date> <date>
#> 1 ak 2021-12-31 0.355 quantiles(0.36)[23] 2021-12-31 2022-01-07
#> 2 al 2021-12-31 0.325 quantiles(0.32)[23] 2021-12-31 2022-01-07
#> 3 ar 2021-12-31 0.496 quantiles(0.5)[23] 2021-12-31 2022-01-07
#> 4 as 2021-12-31 0.0836 quantiles(0.08)[23] 2021-12-31 2022-01-07
#> 5 az 2021-12-31 0.614 quantiles(0.61)[23] 2021-12-31 2022-01-07
#> 6 ca 2021-12-31 0.327 quantiles(0.33)[23] 2021-12-31 2022-01-07
#> # ℹ 50 more rows
The above get_test_data()
function examines the recipe
and ensures that enough test data is available to create the necessary
lags and produce a prediction for the desired future time point (after
the end of the training data). This mimics what would happen if
jhu
contained the most recent available historical data and
we wanted to actually predict the future. We could have instead used any
test data that contained the necessary predictors.
Conclusion
Internally, we provide some simple functions to create reasonable forecasts. But ideally, a user could create their own forecasters by building up the components we provide. In other vignettes, we try to walk through some of these customizations.
To illustrate everything above, here is (roughly) the code for the
flatline_forecaster()
applied to the
case_rate
.
r <- epi_recipe(jhu) %>%
step_epi_ahead(case_rate, ahead = 7, skip = TRUE) %>%
update_role(case_rate, new_role = "predictor") %>%
add_role(all_of(key_colnames(jhu)), new_role = "predictor")
f <- frosting() %>%
layer_predict() %>%
layer_residual_quantiles() %>%
layer_add_forecast_date() %>%
layer_add_target_date() %>%
layer_threshold(starts_with(".pred"))
eng <- linear_reg() %>% set_engine("flatline")
wf <- epi_workflow(r, eng, f) %>% fit(jhu)
preds <- forecast(wf)
All that really differs from the arx_forecaster()
is the
recipe
, the test data, and the engine. The
frosting
is identical, as is the fitting and predicting
procedure.
preds
#> An `epi_df` object, 56 x 6 with metadata:
#> * geo_type = state
#> * time_type = day
#> * other_keys = geo_value, time_value
#> * as_of = 2022-05-31
#>
#> # A tibble: 56 × 6
#> geo_value time_value .pred .pred_distn forecast_date target_date
#> * <chr> <date> <dbl> <dist> <date> <date>
#> 1 ak 2021-12-31 36.4 quantiles(39.58)[2] 2021-12-31 2022-01-07
#> 2 al 2021-12-31 89.9 quantiles(89.92)[2] 2021-12-31 2022-01-07
#> 3 ar 2021-12-31 82.6 quantiles(82.58)[2] 2021-12-31 2022-01-07
#> 4 as 2021-12-31 0 quantiles(21.39)[2] 2021-12-31 2022-01-07
#> 5 az 2021-12-31 58.3 quantiles(58.28)[2] 2021-12-31 2022-01-07
#> 6 ca 2021-12-31 84.4 quantiles(84.37)[2] 2021-12-31 2022-01-07
#> # ℹ 50 more rows