Skip to contents

Accurately backtesting forecasters

Backtesting is a crucial step in the development of forecasting models. It involves testing the model on historical data to see how well it performs. This is important because it allows us to see how well the model generalizes to new data and to identify any potential issues with the model. In the context of epidemiological forecasting, to do backtesting accurately, we need to account for the fact that the data available at the time of the forecast would have been different from the data available at the time of the backtest. This is because new data is constantly being collected and added to the dataset, which can affect the accuracy of the forecast.

For this reason, it is important to use version-aware forecasting, where the model is trained on data that would have been available at the time of the forecast. This ensures that the model is tested on data that is as close as possible to what would have been available in real-time; training and making predictions on finalized data can lead to an overly optimistic sense of accuracy (see, for example, McDonald et al. (2021) and the references therein).

In the epiprocess package, we provide epix_slide(), a function that allows a convenient way to perform version-aware forecasting by only using the data as it would have been available at forecast reference time. In vignette("epi_archive", package = "epiprocess"), we introduced the concept of an epi_archive and we demonstrated how to use epix_slide() to forecast the future using a simple quantile regression model. In this vignette, we will demonstrate how to use epix_slide() to backtest an auto-regressive forecaster on historical COVID-19 case data from the US and Canada. Instead of building a forecaster from scratch as we did in the previous vignette, we will use the arx_forecaster() function from the epipredict package.

Getting case data from US states into an epi_archive

First, we download the version history (ie. archive) of the percentage of doctor’s visits with CLI (COVID-like illness) computed from medical insurance claims and the number of new confirmed COVID-19 cases per 100,000 population (daily) for 6 states from the COVIDcast API (as used in the epiprocess vignette mentioned above).

# Select the `percent_cli` column from the data archive
doctor_visits <- archive_cases_dv_subset$DT %>%
  select(geo_value, time_value, version, percent_cli) %>%
  tidyr::drop_na(percent_cli) %>%
  as_epi_archive(compactify = TRUE)

The data can also be fetched from the Delphi Epidata API with the following query:

library(epidatr)
doctor_visits <- pub_covidcast(
  source = "doctor-visits",
  signals = "smoothed_adj_cli",
  geo_type = "state",
  time_type = "day",
  geo_values = "ca,fl,ny,tx",
  time_values = epirange(20200601, 20211201),
  issues = epirange(20200601, 20211201)
) %>%
  rename(version = issue, percent_cli = value) %>%
  as_epi_archive(compactify = TRUE)

Backtesting a simple autoregressive forecaster

One of the most common use cases of epiprocess::epi_archive() object is for accurate model backtesting.

In this section we will:

  • develop a simple autoregressive forecaster that predicts the next value of the signal based on the current and past values of the signal itself, and
  • demonstrate how to slide this forecaster over the epi_archive object to produce forecasts at a few dates date, using version-unaware and -aware computations,
  • compare the two approaches.

To start, let’s use a simple autoregressive forecaster to predict the percentage of doctor’s hospital visits with CLI (COVID-like illness) (percent_cli) in the future (we choose this target because of the dataset’s pattern of substantial revisions; forecasting doctor’s visits is an unusual forecasting target otherwise). While some AR models output single point forecasts, we will use quantile regression to produce a point prediction along with an 90% uncertainty band, represented by a predictive quantiles at the 5% and 95% levels (lower and upper endpoints of the uncertainty band).

The arx_forecaster() function wraps the autoregressive forecaster we need and comes with sensible defaults:

  • we specify the predicted outcome to be the percentage of doctor’s visits with CLI (percent_cli),
  • we use a linear regression model as the engine,
  • the autoregressive features assume lags of 0, 7, and 14 days,
  • we forecast 7 days ahead.

All these default settings and more can be seen by calling arx_args_list():

arx_args_list()
#>  lags : 0, 7, and 14
#>  ahead : 7
#>  n_training : Inf
#>  quantile_levels : 0.05 and 0.95
#>  forecast_date : "NULL"
#>  target_date : "NULL"
#>  adjust_latency : "none"
#>  warn_latency : TRUE
#>  symmetrize : TRUE
#>  nonneg : TRUE
#>  max_lags : 14
#>  quantile_by_key : "_empty_"
#>  check_enough_data_n : "NULL"
#>  check_enough_data_epi_keys : "NULL"

These can be modified as needed, by sending your desired arguments into arx_forecaster(args_list = arx_args_list()). For now we will use the defaults.

Note: We will use a geo-pooled approach, where we train the model on data from all states and territories combined. This is because the data is quite similar across states, and pooling the data can help improve the accuracy of the forecasts, while also reducing the susceptibility of the model to noise. In the interest of computational speed, we only use the 6 state dataset here, but the full archive can be used in the same way and has performed well in the past. Implementation-wise, geo-pooling is achieved by not using group_by(geo_value) prior to epix_slide(). In other cases, grouping may be preferrable, so we leave it to the user to decide, but flag this modeling decision here.

Let’s use the epix_as_of() method to generate a snapshot of the archive at the last date, and then run the forecaster.

# Let's forecast 14 days prior to the last date in the archive, to compare.
forecast_date <- doctor_visits$versions_end - 14
# The .versions argument selects only the last version in the archive and
# produces a forecast only on that date.
forecasts <- doctor_visits %>%
  epix_slide(
    ~ arx_forecaster(
      .x,
      outcome = "percent_cli",
      predictors = "percent_cli",
      args_list = arx_args_list()
    )$predictions %>%
      pivot_quantiles_wider(.pred_distn),
    .versions = forecast_date
  )
# Join the forecasts with with the latest data at the time of the forecast to
# compare. Since `percent_cli` data has a few days of lag, we use `tidyr::fill` to
# fill the missing values with the last observed value.
forecasts %>%
  inner_join(
    doctor_visits %>%
      epix_as_of(doctor_visits$versions_end) %>%
      group_by(geo_value) %>%
      tidyr::fill(percent_cli),
    by = c("geo_value", "target_date" = "time_value")
  ) %>%
  select(geo_value, forecast_date, .pred, `0.05`, `0.95`, percent_cli)
#> # A tibble: 4 × 6
#>   geo_value forecast_date .pred `0.05` `0.95` percent_cli
#>   <chr>     <date>        <dbl>  <dbl>  <dbl>       <dbl>
#> 1 ca        2021-11-12     7.23  4.22   10.2         4.75
#> 2 fl        2021-11-12     1.40  0       4.41        1.57
#> 3 ny        2021-11-12     3.98  0.968   6.99        3.52
#> 4 tx        2021-11-12     2.23  0       5.24        2.01

The resulting epi_df now contains two new columns: .pred and .pred_distn, corresponding to the point forecast (median) and the quantile distribution containing our requested quantile forecasts (in this case, 0.05 and 0.95) respectively. The forecasts fall within the prediction interval, so our implementation passes a simple validation.

Now let’s go ahead and slide this forecaster in a version unaware way and a version aware way. For the version unaware way, we need to snapshot the latest version of the data, and then make a faux archive by setting version = time_value. This has the effect of simulating a data set that receives the final version updates every day. For the version aware way, we will simply use the true epi_archive object.

archive_cases_dv_subset_faux <- doctor_visits %>%
  epix_as_of(doctor_visits$versions_end) %>%
  mutate(version = time_value) %>%
  as_epi_archive()

To reduce typing, we create the wrapper function forecast_k_week_ahead().

# Latest snapshot of data, and forecast dates
forecast_dates <- seq(from = as.Date("2020-08-01"), to = as.Date("2021-11-01"), by = "1 month")
aheads <- c(7, 14, 21, 28)

# @param epi_archive The epi_archive object to forecast from
# @param ahead The number of days ahead to forecast
# @param outcome The outcome variable to forecast
# @param predictors The predictors to use in the model
# @param forecast_dates The dates to forecast on
# @param process_data A function to process the data before forecasting
forecast_k_week_ahead <- function(
    epi_archive,
    ahead = 7,
    outcome = NULL, predictors = NULL, forecast_dates = NULL, process_data = identity) {
  if (is.null(forecast_dates)) {
    forecast_dates <- epi_archive$versions_end
  }
  if (is.null(outcome) || is.null(predictors)) {
    stop("Please specify the outcome and predictors.")
  }
  epi_archive %>%
    epix_slide(
      ~ arx_forecaster(
        process_data(.x), outcome, predictors,
        args_list = arx_args_list(ahead = ahead)
      )$predictions %>%
        pivot_quantiles_wider(.pred_distn),
      .before = 120,
      .versions = forecast_dates
    )
}
# Generate the forecasts and bind them together
forecasts <- bind_rows(
  map(aheads, ~ forecast_k_week_ahead(
    archive_cases_dv_subset_faux,
    ahead = .x,
    outcome = "percent_cli",
    predictors = "percent_cli",
    forecast_dates = forecast_dates
  ) %>% mutate(version_aware = FALSE)),
  map(aheads, ~ forecast_k_week_ahead(
    doctor_visits,
    ahead = .x,
    outcome = "percent_cli",
    predictors = "percent_cli",
    forecast_dates = forecast_dates
  ) %>% mutate(version_aware = TRUE))
)

Here, arx_forecaster() does all the heavy lifting. It creates leads of the target (respecting time stamps and locations) along with lags of the features (here, the response and doctors visits), estimates a forecasting model using the specified engine, creates predictions, and non-parametric confidence bands.

To see how the predictions compare, we plot them on top of the latest case rates. Note that even though we’ve fitted the model on all states, we’ll just display the results for two states, California (CA) and Florida (FL), to get a sense of the model performance while keeping the graphic simple.

Code for plotting
geo_choose <- "ca"
forecasts_filtered <- forecasts %>%
  filter(geo_value == geo_choose) %>%
  mutate(time_value = version)
percent_cli_data <- bind_rows(
  # Snapshotted data for the version-aware forecasts
  map(
    forecast_dates,
    ~ doctor_visits %>%
      epix_as_of(.x) %>%
      mutate(version = .x)
  ) %>%
    bind_rows() %>%
    mutate(version_aware = TRUE),
  # Latest data for the version-unaware forecasts
  doctor_visits %>%
    epix_as_of(doctor_visits$versions_end) %>%
    mutate(version_aware = FALSE)
) %>%
  filter(geo_value == geo_choose)

p1 <- ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) +
  geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = factor(time_value)), alpha = 0.4) +
  geom_line(aes(y = .pred, color = factor(time_value)), linetype = 2L) +
  geom_point(aes(y = .pred, color = factor(time_value)), size = 0.75) +
  geom_vline(data = percent_cli_data, aes(color = factor(version), xintercept = version), lty = 2) +
  geom_line(
    data = percent_cli_data,
    aes(x = time_value, y = percent_cli, color = factor(version)),
    inherit.aes = FALSE, na.rm = TRUE
  ) +
  facet_grid(version_aware ~ geo_value, scales = "free") +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  scale_y_continuous(expand = expansion(c(0, 0.05))) +
  labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") +
  theme(legend.position = "none")
geo_choose <- "fl"
forecasts_filtered <- forecasts %>%
  filter(geo_value == geo_choose) %>%
  mutate(time_value = version)
percent_cli_data <- bind_rows(
  # Snapshotted data for the version-aware forecasts
  map(
    forecast_dates,
    ~ doctor_visits %>%
      epix_as_of(.x) %>%
      mutate(version = .x)
  ) %>%
    bind_rows() %>%
    mutate(version_aware = TRUE),
  # Latest data for the version-unaware forecasts
  doctor_visits %>%
    epix_as_of(doctor_visits$versions_end) %>%
    mutate(version_aware = FALSE)
) %>%
  filter(geo_value == geo_choose)

p2 <- ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) +
  geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = factor(time_value)), alpha = 0.4) +
  geom_line(aes(y = .pred, color = factor(time_value)), linetype = 2L) +
  geom_point(aes(y = .pred, color = factor(time_value)), size = 0.75) +
  geom_vline(data = percent_cli_data, aes(color = factor(version), xintercept = version), lty = 2) +
  geom_line(
    data = percent_cli_data,
    aes(x = time_value, y = percent_cli, color = factor(version)),
    inherit.aes = FALSE, na.rm = TRUE
  ) +
  facet_grid(version_aware ~ geo_value, scales = "free") +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  scale_y_continuous(expand = expansion(c(0, 0.05))) +
  labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") +
  theme(legend.position = "none")
#> Warning: Removed 544 rows containing missing values or values outside the scale range
#> (`geom_vline()`).

#> Warning: Removed 544 rows containing missing values or values outside the scale range
#> (`geom_vline()`).

For the two states of interest, neither approach produces amazingly accurate forecasts. However, the extent to which using versioned data can affect backtesting, scoring, and therefore model choice for production can be inferred from these plots.

Example using case data from Canada

Data and forecasts. Similar to the above.

By leveraging the flexibility of epiprocess, we can apply the same techniques to data from other sources. Since some collaborators are in British Columbia, Canada, we’ll do essentially the same thing for Canada as we did above.

The COVID-19 Canada Open Data Working Group collects daily time series data on COVID-19 cases, deaths, recoveries, testing and vaccinations at the health region and province levels. Data are collected from publicly available sources such as government datasets and news releases. Unfortunately, there is no simple versioned source, so we have created our own from the Github commit history.

First, we load versioned case rates at the provincial level. After converting these to 7-day averages (due to highly variable provincial reporting mismatches), we then convert the data to an epi_archive object, and extract the latest version from it. Finally, we run the same forcasting exercise as for the American data, but here we compare the forecasts produced from using simple linear regression with those from using boosted regression trees.

aheads <- c(7, 14, 21, 28)
canada_archive <- epidatasets::can_prov_cases
canada_archive_faux <- epix_as_of(canada_archive, canada_archive$versions_end) %>%
  mutate(version = time_value) %>%
  as_epi_archive()
# This function will add the 7-day average of the case rate to the data
# before forecasting.
smooth_cases <- function(epi_df) {
  epi_df %>%
    group_by(geo_value) %>%
    epi_slide_mean("case_rate", .window_size = 7, na.rm = TRUE) %>%
    rename(cr_7dav = slide_value_case_rate)
}
forecast_dates <- seq.Date(
  from = min(canada_archive$DT$version),
  to = max(canada_archive$DT$version),
  by = "1 month"
)

# Generate the forecasts, and bind them together
canada_forecasts <- bind_rows(
  map(
    aheads,
    ~ forecast_k_week_ahead(
      canada_archive_faux,
      ahead = .x,
      outcome = "cr_7dav",
      predictors = "cr_7dav",
      forecast_dates = forecast_dates,
      process_data = smooth_cases
    ) %>% mutate(version_aware = FALSE)
  ),
  map(
    aheads,
    ~ forecast_k_week_ahead(
      canada_archive,
      ahead = .x,
      outcome = "cr_7dav",
      predictors = "cr_7dav",
      forecast_dates = forecast_dates,
      process_data = smooth_cases
    ) %>% mutate(version_aware = TRUE)
  )
)

The figures below shows the results for a single province.

geo_choose <- "Alberta"
forecasts_filtered <- canada_forecasts %>%
  filter(geo_value == geo_choose) %>%
  mutate(time_value = version)
case_rate_data <- bind_rows(
  # Snapshotted data for the version-aware forecasts
  map(
    forecast_dates,
    ~ canada_archive %>%
      epix_as_of(.x) %>%
      smooth_cases() %>%
      mutate(case_rate = cr_7dav, version = .x)
  ) %>%
    bind_rows() %>%
    mutate(version_aware = TRUE),
  # Latest data for the version-unaware forecasts
  canada_archive %>%
    epix_as_of(doctor_visits$versions_end) %>%
    smooth_cases() %>%
    mutate(case_rate = cr_7dav, version_aware = FALSE)
) %>%
  filter(geo_value == geo_choose)

ggplot(data = forecasts_filtered, aes(x = target_date, group = time_value)) +
  geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = factor(time_value)), alpha = 0.4) +
  geom_line(aes(y = .pred, color = factor(time_value)), linetype = 2L) +
  geom_point(aes(y = .pred, color = factor(time_value)), size = 0.75) +
  geom_vline(data = case_rate_data, aes(color = factor(version), xintercept = version), lty = 2) +
  geom_line(
    data = case_rate_data,
    aes(x = time_value, y = case_rate, color = factor(version)),
    inherit.aes = FALSE, na.rm = TRUE
  ) +
  facet_grid(version_aware ~ geo_value, scales = "free") +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  scale_y_continuous(expand = expansion(c(0, 0.05))) +
  labs(x = "Date", y = "smoothed, day of week adjusted covid-like doctors visits") +
  theme(legend.position = "none")