14  Sliding version-unaware and version-aware ARX forecasters across dates

A key function from the epiprocess package is epix_slide() (refer to the following vignette for the basics of the function: “Work with archive objects and data revisions”) which allows performing version-aware computations. That is, the function only uses data that would have been available as of time t for that reference time.

In this vignette, we use epix_slide() for backtesting our arx_forecaster on historical COVID-19 case data from the US and from Canada. We first examine the results from a version-unaware forecaster, comparing two different fitting engines and then we contrast this with version-aware forecasting. The former will proceed by constructing an epi_archive that erases its version information and then use epix_slide() to forecast the future. The latter will keep the versioned data and proceed similarly by using epix_slide() to forecast the future.

14.1 Version-unaware forecasting

14.1.1 Example using CLI and case data from US states

First, we download the version history (i.e. archive) of the percentage of doctor’s visits with CLI (COVID-like illness) computed from medical insurance claims and the number of new confirmed COVID-19 cases per 100,000 population (daily) for all 50 states from the COVIDcast API. We process as before, with the modification that we use sync = "locf" in epix_merge() so that the last version of each observation can be carried forward to extrapolate unavailable versions for the less up-to-date input archive.

us_raw_history_dfs <- readRDS(url(
  "https://github.com/cmu-delphi/epipredict/raw/dev/vignettes/articles/all_states_covidcast_signals.rds"
))

us_cli_archive <- us_raw_history_dfs[[1]] %>%
  select(geo_value, time_value, version = issue, percent_cli = value) %>%
  as_epi_archive(compactify = TRUE)
us_cases_archive <- us_raw_history_dfs[[2]] %>%
  select(geo_value, time_value, version = issue, case_rate = value) %>%
  as_epi_archive(compactify = TRUE)

us_archive <- epix_merge(
  us_cli_archive, us_cases_archive,
  sync = "locf", compactify = TRUE
)

We then get latest snapshot of the data from the archive by using epix_as_of(). We then create fake version information by setting version = time_value. This creates an archive that pretends to have the latest data available (since at version time x it has all the data up to time_value x, which in reality is unrealistic because the time values of the data received at version time x often lags by a few days, not to mention the later corrections that are amended to the data).

# Get latest snapshot of data and pretend it's an archive
us_latest <- us_archive %>%
  epix_as_of(version = max(.$versions_end)) %>%
  mutate(version = time_value) %>%
  as_epi_archive()
fc_time_values <- seq(
  from = as.Date("2020-08-01"),
  to = as.Date("2021-11-01"),
  by = "1 month"
)
aheads <- c(7, 14, 21, 28)

forecast_k_week_ahead <- function(epi_archive, outcome, predictors, ahead = 7, engine) {
  epi_archive %>%
    epix_slide(
      .f = function(x, gk, rtv) {
        arx_forecaster(
          x, outcome, predictors, engine,
          args_list = arx_args_list(ahead = ahead)
        )$predictions %>%
          mutate(engine_type = engine$engine) %>%
          pivot_quantiles_wider(.pred_distn)
      },
      .before = 120,
      .versions = fc_time_values
    )
}

# Generate the forecasts and bind them together
forecasts_version_unaware <- bind_rows(
  map(aheads, ~ forecast_k_week_ahead(
    us_latest,
    outcome = "case_rate",
    predictors = c("case_rate", "percent_cli"),
    ahead = .x,
    engine = linear_reg()
  )),
  map(aheads, ~ forecast_k_week_ahead(
    us_latest,
    outcome = "case_rate",
    predictors = c("case_rate", "percent_cli"),
    ahead = .x,
    engine = rand_forest(mode = "regression")
  ))
)

Here, arx_forecaster() does all the heavy lifting. It creates leads of the target (respecting time stamps and locations) along with lags of the features (here, the response and doctors visits), estimates a forecasting model using the specified engine, creates predictions, and non-parametric confidence bands.

To see how the predictions compare, we plot them on top of the latest case rates. Note that even though we’ve fitted the model on all states, we’ll just display the results for two states, California (CA) and Florida (FL), to get a sense of the model performance while keeping the graphic simple.

Code
forecasts_filtered <- forecasts_version_unaware %>%
  tibble() %>%
  filter(geo_value %in% c("ca", "fl"))
latest_data_filtered <- us_latest$DT %>%
  tibble() %>%
  filter(geo_value %in% c("ca", "fl"))

ggplot(forecasts_filtered, aes(x = target_date, group = forecast_date, fill = engine_type)) +
  geom_line(
    data = latest_data_filtered, aes(x = time_value, y = case_rate),
    inherit.aes = FALSE, color = "gray50"
  ) +
  geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`), alpha = 0.4) +
  geom_line(aes(y = .pred)) +
  geom_point(aes(y = .pred), size = 0.5) +
  geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
  facet_grid(engine_type ~ geo_value, scales = "free") +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(expand = expansion(c(0, 0.05))) +
  labs(x = "Date", y = "Reported COVID-19 case rates") +
  theme(legend.position = "none")

For the two states of interest, simple linear regression clearly performs better than random forest in terms of accuracy of the predictions and does not result in such in overconfident predictions (overly narrow confidence bands). Though, in general, neither approach produces amazingly accurate forecasts. This could be because the behaviour is rather different across states and the effects of other notable factors such as age and public health measures may be important to account for in such forecasting. Including such factors as well as making enhancements such as correcting for outliers are some improvements one could make to this simple model.1

14.1.2 Example using case data from Canada

By leveraging the flexibility of epiprocess, we can apply the same techniques to data from other sources. Since some collaborators are in British Columbia, Canada, we’ll do essentially the same thing for Canada as we did above.

The COVID-19 Canada Open Data Working Group collects daily time series data on COVID-19 cases, deaths, recoveries, testing and vaccinations at the health region and province levels. Data are collected from publicly available sources such as government datasets and news releases. Unfortunately, there is no simple versioned source, so we have created our own from the Github commit history.

First, we load versioned case rates at the provincial level. After converting these to 7-day averages (due to highly variable provincial reporting mismatches), we then convert the data to an epi_archive object, and extract the latest version from it. Finally, we run the same forcasting exercise as for the American data, but here we compare the forecasts produced from using simple linear regression with those from using boosted regression trees.

# source("drafts/canada-case-rates.R)
can <- epidatasets::can_prov_cases
can <- can %>%
  group_by(version, geo_value) %>%
  arrange(time_value) %>%
  mutate(cr_7dav = RcppRoll::roll_meanr(case_rate, n = 7L)) %>%
  as_epi_archive(compactify = TRUE)

can_latest <- epix_as_of(can, max_version = max(can$DT$version)) %>%
  mutate(version = time_value) %>%
  as_epi_archive()

# Generate the forecasts, and bind them together
can_fc <- bind_rows(
  map(
    aheads,
    ~ forecast_k_week_ahead(can_latest, "cr_7dav", "cr_7dav", .x, linear_reg())
  ),
  map(
    aheads,
    ~ forecast_k_week_ahead(can_latest, "cr_7dav", "cr_7dav", .x, boost_tree(mode = "regression", trees = 20))
  )
)

The first figure shows the results for all of the provinces using linear regression.

Code
ggplot(
  can_fc %>% filter(engine_type == "lm"),
  aes(x = target_date, group = forecast_date)
) +
  coord_cartesian(xlim = lubridate::ymd(c("2020-12-01", NA))) +
  geom_line(
    data = can_latest$DT %>% tibble(), aes(x = time_value, y = cr_7dav),
    inherit.aes = FALSE, color = "gray50"
  ) +
  geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value),
    alpha = 0.4
  ) +
  geom_line(aes(y = .pred)) +
  geom_point(aes(y = .pred), size = 0.5) +
  geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
  facet_wrap(~geo_value, scales = "free_y", ncol = 3) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  scale_y_continuous(expand = expansion(c(0, 0.05))) +
  labs(
    title = "Using simple linear regression", x = "Date",
    y = "Reported COVID-19 case rates"
  ) +
  theme(legend.position = "none")

Compare those forecasts with a related set using Gradient Boosting.

Code
ggplot(
  can_fc %>% filter(engine_type == "xgboost"),
  aes(x = target_date, group = forecast_date)
) +
  coord_cartesian(xlim = lubridate::ymd(c("2020-12-01", NA))) +
  geom_line(
    data = can_latest$DT %>% tibble(), aes(x = time_value, y = cr_7dav),
    inherit.aes = FALSE, color = "gray50"
  ) +
  geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value),
    alpha = 0.4
  ) +
  geom_line(aes(y = .pred)) +
  geom_point(aes(y = .pred), size = 0.5) +
  geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
  facet_wrap(~geo_value, scales = "free_y", ncol = 3) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  scale_y_continuous(expand = expansion(c(0, 0.05))) +
  labs(
    title = "Using boosted regression trees", x = "Date",
    y = "Reported COVID-19 case rates"
  ) +
  theme(legend.position = "none")

Both approaches tend to produce quite volatile forecasts (point predictions) and/or are overly confident (very narrow bands), particularly when boosted regression trees are used. But as this is meant to be a simple demonstration of sliding with different engines in arx_forecaster, we may devote another vignette to work on improving the predictive modelling using the suite of tools available in epipredict.

14.2 Version-aware forecasting

14.2.1 Example using case data from US states

We will now run pseudoprospective forecasts based on properly-versioned data (that would have been available in real-time) to forecast future COVID-19 case rates from current and past COVID-19 case rates for all states. All we have to do is use the historical archive of the data with version information, us_archive, instead of us_latest like we did above, in the argument to our forecaster wrapper forecast_k_week_ahead(). Below we do that computation, tag it, and combine it with the forecasts from one of the engines made above.

forecasts_version_aware <- map(aheads, ~ forecast_k_week_ahead(
  us_archive,
  outcome = "case_rate",
  predictors = c("case_rate", "percent_cli"),
  ahead = .x,
  engine = linear_reg()
)) %>%
  bind_rows() %>%
  mutate(version = "version faithful")

Now we can plot the results on top of the latest case rates. As before, we will only display and focus on the results for FL and CA for simplicity.

Code
forecasts_filtered <- bind_rows(
  forecasts_version_aware,
  forecasts_version_unaware %>%
    filter(engine_type == "lm") %>%
    mutate(version = "version unfaithful")
) %>%
  tibble() %>%
  filter(geo_value %in% c("ca", "fl"))
latest_data_filtered <- us_latest$DT %>%
  tibble() %>%
  select(-version) %>%
  filter(geo_value %in% c("ca", "fl"))

ggplot(forecasts_filtered, aes(x = target_date, group = forecast_date, fill = version)) +
  geom_line(
    data = latest_data_filtered, aes(x = time_value, y = case_rate),
    inherit.aes = FALSE, color = "gray50"
  ) +
  geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`), alpha = 0.4) +
  geom_line(aes(y = .pred)) +
  geom_point(aes(y = .pred), size = 0.5) +
  geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
  facet_grid(version ~ geo_value, scales = "free") +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(expand = expansion(c(0, 0.05))) +
  labs(x = "Date", y = "Reported COVID-19 case rates") +
  theme(legend.position = "none")

Again, we observe that the results are not great for these two states, but that’s likely due to the simplicity of the model (ex. the omission of key factors such as age and public health measures) and the quality of the data (ex. we have not personally corrected for anomalies in the data).

We shall leave it to the reader to try the above version aware and unaware forecasting exercise on the Canadian case rate data. The above code for the American state data should be readily adaptable for this purpose.


  1. Note that, despite the above caveats, simple models like this tend to out-perform many far more complicated models in the online Covid forecasting due to those models high variance predictions.↩︎