MINI-PROJECT 3: Customizing models

Author

Delphi + MICOM

Published

April 27, 2026

Load packages

The {InsightNetApr26} package ensures all required Delphi tools and their correct versions/branches are installed.

if (!requireNamespace("pak", quietly = TRUE)) install.packages("pak")
if (!requireNamespace("InsightNetApr26", quietly = TRUE)) {
  pak::pkg_install("cmu-delphi/InsightNet-apr-2026/InsightNetApr26")
}
InsightNetApr26::verify_setup()

# If pak demands Rtools and you don't have it, you can use this instead:
#
# if (!requireNamespace("remotes", quietly = TRUE)) {
# install.packages("remotes")
# }
# remotes::install_github("cmu-delphi/InsightNet-apr-2026/InsightNetApr26")
# remotes::install_github("cmu-delphi/epidatr")
# remotes::install_github("cmu-delphi/epidatasets")
# remotes::install_github("cmu-delphi/epiprocess")
# remotes::install_github("cmu-delphi/epipredict")

library(epidatr)
library(epiprocess)
library(epipredict)
library(dplyr)
library(ggplot2)
library(parsnip)

3.1 Basic Modelling

⭐ We will use the built-in covid_case_death_rates dataset provided by the epidatasets package, which automatically loads with the other Delphi tools.

Hint

You can get a partial view of the dataset by entering covid_case_death_rates in the R console.

Filter out the observations after August 1, 2021, to create a training set.

# We define our forecast date and filter the data to only include
# what was available at that time.
forecast_date <- as.Date("2021-08-01")

training_set <- covid_case_death_rates |>
  # pretend we have one day of latency:
  filter(time_value <= forecast_date - 1L)

In a real-world scenario, reporting delays mean that recent data is often missing or subject to revision. If you accidentally trained your model on data that wasn’t actually available at the time of the forecast, how would this “data leakage” give you a misleading impression of your model’s accuracy?

Note

This is why, for serious applications, we want to use epi_archives and epix_as_of or epix_slide rather than the filtering approach above, in order to faithfully reflect the data that would have been fed into a model and how the model would have behaved.

Filter to four states of your choosing and use autoplot() to explore the death_rate and case_rate columns.

# Pick four states and visualize the trends
target_states <- c("ca", "fl", "ny", "pa")

autoplot(training_set, death_rate, case_rate, .facet_filter = geo_value %in% target_states)

Let’s fit an autoregressive model using lagged values of the death_rate column as features with arx_forecaster() on our training set.

Hint

You need to provide the dataset and the outcome argument to arx_forecaster()

# Fit a default AR model
fcst_ar <- arx_forecaster(
  training_set,
  outcome = "death_rate"
)

fcst_ar

When you run a forecaster, it returns an object containing several useful components.
- Use autoplot(your_forecast) for a quick plot.
- Use your_forecast$predictions on the object returned by the forecast to view the results. Use pivot_quantiles_wider() to expand the quantile distribution column into individual columns.
- Use your_forecast$epi_workflow to access the underlying model. To see exactly how much weight the model gave to each lag, pass this object to hardhat::extract_fit_engine().

# Quick plot
autoplot(fcst_ar, .facet_filter = geo_value %in% target_states)

# Inspect predictions and pivot quantiles
fcst_ar$predictions |>
  pivot_quantiles_wider(.pred_distn) |>
  head()

# A tibble: 6 × 11
  geo_value  .pred forecast_date target_date `0.05`  `0.1` `0.25`  `0.5` `0.75`
  <chr>      <dbl> <date>        <date>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 ak        0.0916 2021-07-31    2021-08-07  0      0      0.0463 0.0916 0.137 
2 al        0.136  2021-07-31    2021-08-07  0      0.0227 0.0906 0.136  0.181 
3 ar        0.286  2021-07-31    2021-08-07  0.0814 0.173  0.240  0.286  0.331 
4 as        0.0366 2021-07-31    2021-08-07  0      0      0      0.0366 0.0819
5 az        0.140  2021-07-31    2021-08-07  0      0.0265 0.0943 0.140  0.185 
6 ca        0.0860 2021-07-31    2021-08-07  0      0      0.0406 0.0860 0.131 
# ℹ 2 more variables: `0.9` <dbl>, `0.95` <dbl>

# Access model weights
hardhat::extract_fit_engine(fcst_ar$epi_workflow)


Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
      (Intercept)   lag_0_death_rate   lag_7_death_rate  lag_14_death_rate  
          0.03658            0.37151            0.20792            0.15187

How heavily does the model weigh data from exactly one week ago compared to yesterday? Try running hardhat::extract_fit_engine(your_forecast$epi_workflow) to examine the fitted coefficients for each lag. Which lag contributes the most to the forecast?

⚔️ Compare your autoregressive model fitted with arx_forecaster() to a simple model. The flatline forecaster is a common baseline that assumes the future will be exactly the last observed value, with uncertainty that grows as the forecast horizon grows. Fit a flatline model using flatline_forecaster() and examine the predictions to see how the quantiles behave.

# Fit a flatline baseline model
fcst_flat <- flatline_forecaster(
  training_set,
  outcome = "death_rate"
)

fcst_flat

# Observe how quantiles widen over time
fcst_flat$predictions |>
  pivot_quantiles_wider(.pred_distn)

# A tibble: 56 × 11
   geo_value  .pred forecast_date target_date `0.05`   `0.1` `0.25`  `0.5`
   <chr>      <dbl> <date>        <date>       <dbl>   <dbl>  <dbl>  <dbl>
 1 ak        0.0988 2021-07-31    2021-08-07   0     0       0.0500 0.0988
 2 al        0.154  2021-07-31    2021-08-07   0     0.00908 0.105  0.154 
 3 ar        0.438  2021-07-31    2021-08-07   0.188 0.294   0.390  0.438 
 4 as        0      2021-07-31    2021-08-07   0     0       0      0     
 5 az        0.146  2021-07-31    2021-08-07   0     0.00153 0.0975 0.146 
 6 ca        0.0697 2021-07-31    2021-08-07   0     0       0.0209 0.0697
 7 co        0.130  2021-07-31    2021-08-07   0     0       0.0816 0.130 
 8 ct        0.0281 2021-07-31    2021-08-07   0     0       0      0.0281
 9 dc        0.0601 2021-07-31    2021-08-07   0     0       0.0113 0.0601
10 de        1.95   2021-07-31    2021-08-07   1.70  1.81    1.91   1.95  
# ℹ 46 more rows
# ℹ 3 more variables: `0.75` <dbl>, `0.9` <dbl>, `0.95` <dbl>

3.2 Model Customization

⭐ While “canned” forecasters such as arx_forecaster() are useful for establishing baselines, you will often want to customize your models. Use the arx_forecaster() arguments to customize your model.

So far, we have only used the outcome itself as a predictor. However, many signals serve as leading indicators. For example, changes in case rates often precede changes in death rates. Use both case_rate and death_rate as predictors of death_rate in arx_forecaster(). [Hint: use the predictors argument.]

# Use multiple signals to improve the forecast
fcst_multi <- arx_forecaster(
  training_set,
  outcome = "death_rate",
  predictors = c("case_rate", "death_rate")
)

fcst_multi

autoplot(fcst_multi, .facet_filter = geo_value %in% target_states)

By default, arx_forecaster() uses linear regression. Swap this for quantile_reg().

# Use Quantile Regression for better uncertainty estimation
fcst_qr <- arx_forecaster(
  training_set,
  outcome = "death_rate",
  trainer = quantile_reg()
)

fcst_qr

autoplot(fcst_qr, .facet_filter = geo_value %in% target_states)

Further customize your model by modifying the arx_args_list() arguments via the args_list parameter in arx_forecaster().

Hint

You can access the arx_args_list() documentation by typing ?arx_args_list in R.

Use the lags argument in arx_args_list() to specify provide a vector of integers representing the historical days for each indicator.

Hint

lags = c(0, 7, 14) tells the model to use data from today, one week ago, and two weeks ago as predictors. If you are using multiple indicators as predictors, you can provide a list of vectors (e.g., lags = list(c(0, 7), c(0, 14))) where each vector corresponds to one of your indicators.

Set the ahead argument to 28.

# Customize lags and horizon
custom_args <- arx_args_list(
  lags = list(c(0, 7, 14), c(0, 7, 14)), # Lags for case_rate and death_rate
  ahead = 28
)

fcst_custom <- arx_forecaster(
  training_set,
  outcome = "death_rate",
  predictors = c("case_rate", "death_rate"),
  args_list = custom_args
)

fcst_custom

autoplot(fcst_custom, .facet_filter = geo_value %in% target_states)

⚔️ Create a forecast trajectory over a range of dates and plot it with autoplot(). This creates a full trajectory showing how uncertainty grows with the horizon.

Hint

See this example on creating models for multiple ahead arguments: cmu-delphi.github.io/epipredict/articles/epipredict.html#generating-multiple-aheads

# We loop over horizons to create a trajectory (0 to 28 days)
all_canned_results <- lapply(0:28, \(days_ahead) {
  arx_forecaster(
    training_set,
    outcome = "death_rate",
    predictors = c("case_rate", "death_rate"),
    trainer = quantile_reg(),
    args_list = arx_args_list(
      lags = list(c(0, 1, 2, 3, 7, 14), c(0, 7, 14)),
      ahead = days_ahead
    )
  )
})

# Extract and combine predictions from all horizons
results <- bind_rows(lapply(all_canned_results, \(x) x$predictions))

autoplot(
  object = all_canned_results[[1]]$epi_workflow,
  predictions = results,
  observed_response = covid_case_death_rates |>
    filter(time_value > "2021-07-01"), 
    .facet_filter = geo_value %in% target_states
)

3.3 Use Delphi data in other model types

⭐ We will use arx_classifier() to predict disease growth categories. This model doesn’t just predict a value, but it predicts the probability that the growth rate falls into specific categories.

Create a classifier to predict whether death_rate is <= 0.01 or > 0.01 using case_rate and death_rate as predictors.

Hint

To configure the classifier, we use arx_class_args_list() as input in the args_list argument. The key parameter is breaks, which defines the thresholds for our categories.

# Categorical prediction: is death_rate low (<= 0.01) or high (> 0.01)?

classifier_fcst <- arx_classifier(
  training_set |> filter(geo_value %in% c(tolower(state.abb), "dc")),
  outcome = "death_rate",
  predictors = c("case_rate", "death_rate"),
  args_list = arx_class_args_list(
    breaks = 0.01,
    ahead = 7
  )
)

classifier_fcst

# View the classification probabilities
classifier_fcst$predictions |>
  head()

# A tibble: 6 × 4
  geo_value .pred_class forecast_date target_date
  <chr>     <fct>       <date>        <date>     
1 ak        (0.01, Inf] 2021-07-31    2021-08-07 
2 al        (0.01, Inf] 2021-07-31    2021-08-07 
3 ar        (0.01, Inf] 2021-07-31    2021-08-07 
4 az        (0.01, Inf] 2021-07-31    2021-08-07 
5 ca        (0.01, Inf] 2021-07-31    2021-08-07 
6 co        (-Inf,0.01] 2021-07-31    2021-08-07