```
library(epipredict)
library(epidatr)
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(magrittr)
library(purrr)
```

## Demonstrations of sliding AR and ARX forecasters

A key function from the epiprocess package is
`epix_slide()`

(refer to the following vignette for the
basics of the function: “Work
with archive objects and data revisions”) which allows performing
version-aware computations. That is, the function only uses data that
would have been available as of time t for that reference time.

In this vignette, we use `epix_slide()`

for backtesting
our `arx_forecaster`

on historical COVID-19 case data from
the US and from Canada. We first examine the results from a
version-unaware forecaster, comparing two different fitting engines and
then we contrast this with version-aware forecasting. The former will
proceed by constructing an `epi_archive`

that erases its
version information and then use `epix_slide()`

to forecast
the future. The latter will keep the versioned data and proceed
similarly by using `epix_slide()`

to forecast the future.

### Comparing different forecasting engines

#### Example using CLI and case data from US states

First, we download the version history (ie. archive) of the percentage of doctor’s visits with CLI (COVID-like illness) computed from medical insurance claims and the number of new confirmed COVID-19 cases per 100,000 population (daily) for all 50 states from the COVIDcast API.

## Load a data archive

We process as before, with the modification that we use
`sync = locf`

in `epix_merge()`

so that the last
version of each observation can be carried forward to extrapolate
unavailable versions for the less up-to-date input archive.

```
theme_set(theme_bw())
y <- readRDS("all_states_covidcast_signals.rds") %>%
purrr::map(~ select(.x, geo_value, time_value, version = issue, value))
x <- epix_merge(
y[[1]] %>% rename(percent_cli = value) %>% as_epi_archive(compactify = FALSE),
y[[2]] %>% rename(case_rate = value) %>% as_epi_archive(compactify = FALSE),
sync = "locf",
compactify = TRUE
)
rm(y)
```

We then obtaining the latest snapshot of the data and proceed to fake
the version information by setting `version = time_value`

.
This has the effect of obtaining data that arrives exactly at the day of
the time_value.

```
# Latest snapshot of data, and forecast dates
x_latest <- epix_as_of(x, version = max(x$versions_end)) %>%
mutate(version = time_value) %>%
as_epi_archive()
fc_time_values <- seq(
from = as.Date("2020-08-01"),
to = as.Date("2021-11-01"),
by = "1 month"
)
aheads <- c(7, 14, 21, 28)
forecast_k_week_ahead <- function(epi_archive, outcome, predictors, ahead = 7, engine) {
epi_archive %>%
epix_slide(
.f = function(x, gk, rtv) {
arx_forecaster(
x, outcome, predictors, engine,
args_list = arx_args_list(ahead = ahead)
)$predictions %>%
mutate(engine_type = engine$engine) %>%
pivot_quantiles_wider(.pred_distn)
},
.before = 120,
.versions = fc_time_values
)
}
```

```
# Generate the forecasts and bind them together
fc <- bind_rows(
map(aheads, ~ forecast_k_week_ahead(
x_latest,
outcome = "case_rate",
predictors = c("case_rate", "percent_cli"),
ahead = .x,
engine = linear_reg()
)),
map(aheads, ~ forecast_k_week_ahead(
x_latest,
outcome = "case_rate",
predictors = c("case_rate", "percent_cli"),
ahead = .x,
engine = rand_forest(mode = "regression")
))
)
```

Here, `arx_forecaster()`

does all the heavy lifting. It
creates leads of the target (respecting time stamps and locations) along
with lags of the features (here, the response and doctors visits),
estimates a forecasting model using the specified engine, creates
predictions, and non-parametric confidence bands.

To see how the predictions compare, we plot them on top of the latest case rates. Note that even though we’ve fitted the model on all states, we’ll just display the results for two states, California (CA) and Florida (FL), to get a sense of the model performance while keeping the graphic simple.

## Code for plotting

```
fc_cafl <- fc %>%
tibble() %>%
filter(geo_value %in% c("ca", "fl"))
x_latest_cafl <- x_latest$DT %>%
tibble() %>%
filter(geo_value %in% c("ca", "fl"))
p1 <- ggplot(fc_cafl, aes(target_date, group = forecast_date, fill = engine_type)) +
geom_line(
data = x_latest_cafl, aes(x = time_value, y = case_rate),
inherit.aes = FALSE, color = "gray50"
) +
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`), alpha = 0.4) +
geom_line(aes(y = .pred)) +
geom_point(aes(y = .pred), size = 0.5) +
geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
facet_grid(vars(geo_value), vars(engine_type), scales = "free") +
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
scale_fill_brewer(palette = "Set1") +
labs(x = "Date", y = "Reported COVID-19 case rates") +
theme(legend.position = "none")
```

For the two states of interest, simple linear regression clearly
performs better than random forest in terms of accuracy of the
predictions and does not result in such in overconfident predictions
(overly narrow confidence bands). Though, in general, neither approach
produces amazingly accurate forecasts. This could be because the
behaviour is rather different across states and the effects of other
notable factors such as age and public health measures may be important
to account for in such forecasting. Including such factors as well as
making enhancements such as correcting for outliers are some
improvements one could make to this simple model.^{1}

#### Example using case data from Canada

## Data and forecasts. Similar to the above.

By leveraging the flexibility of `epiprocess`

, we can
apply the same techniques to data from other sources. Since some
collaborators are in British Columbia, Canada, we’ll do essentially the
same thing for Canada as we did above.

The COVID-19 Canada Open Data Working Group collects daily time series data on COVID-19 cases, deaths, recoveries, testing and vaccinations at the health region and province levels. Data are collected from publicly available sources such as government datasets and news releases. Unfortunately, there is no simple versioned source, so we have created our own from the Github commit history.

First, we load versioned case rates at the provincial level. After
converting these to 7-day averages (due to highly variable provincial
reporting mismatches), we then convert the data to an
`epi_archive`

object, and extract the latest version from it.
Finally, we run the same forcasting exercise as for the American data,
but here we compare the forecasts produced from using simple linear
regression with those from using boosted regression trees.

```
# source("drafts/canada-case-rates.R)
can <- readRDS(system.file(
"extdata", "can_prov_cases.rds",
package = "epipredict", mustWork = TRUE
)) %>%
group_by(version, geo_value) %>%
arrange(time_value) %>%
mutate(cr_7dav = RcppRoll::roll_meanr(case_rate, n = 7L)) %>%
as_epi_archive(compactify = TRUE)
can_latest <- epix_as_of(can, version = max(can$DT$version)) %>%
mutate(version = time_value) %>%
as_epi_archive()
# Generate the forecasts, and bind them together
can_fc <- bind_rows(
map(
aheads,
~ forecast_k_week_ahead(can_latest, "cr_7dav", "cr_7dav", .x, linear_reg())
),
map(
aheads,
~ forecast_k_week_ahead(
can_latest, "cr_7dav", "cr_7dav", .x,
boost_tree(mode = "regression", trees = 20)
)
)
)
```

The figures below shows the results for all of the provinces.

```
ggplot(
can_fc %>% filter(engine_type == "lm"),
aes(x = target_date, group = forecast_date)
) +
coord_cartesian(xlim = lubridate::ymd(c("2020-12-01", NA))) +
geom_line(
data = can_latest$DT %>% tibble(), aes(x = time_value, y = cr_7dav),
inherit.aes = FALSE, color = "gray50"
) +
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value),
alpha = 0.4
) +
geom_line(aes(y = .pred)) +
geom_point(aes(y = .pred), size = 0.5) +
geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
facet_wrap(~geo_value, scales = "free_y", ncol = 3) +
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
labs(
title = "Using simple linear regression", x = "Date",
y = "Reported COVID-19 case rates"
) +
theme(legend.position = "none")
```

```
ggplot(
can_fc %>% filter(engine_type == "xgboost"),
aes(x = target_date, group = forecast_date)
) +
coord_cartesian(xlim = lubridate::ymd(c("2020-12-01", NA))) +
geom_line(
data = can_latest$DT %>% tibble(), aes(x = time_value, y = cr_7dav),
inherit.aes = FALSE, color = "gray50"
) +
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value),
alpha = 0.4
) +
geom_line(aes(y = .pred)) +
geom_point(aes(y = .pred), size = 0.5) +
geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
facet_wrap(~geo_value, scales = "free_y", ncol = 3) +
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
labs(
title = "Using boosted regression trees", x = "Date",
y = "Reported COVID-19 case rates"
) +
theme(legend.position = "none")
```

Both approaches tend to produce quite volatile forecasts (point
predictions) and/or are overly confident (very narrow bands),
particularly when boosted regression trees are used. But as this is
meant to be a simple demonstration of sliding with different engines in
`arx_forecaster`

, we may devote another vignette to work on
improving the predictive modelling using the suite of tools available in
epipredict.

### Version-aware forecasting

#### Example using case data from US states

We will now employ a forecaster that uses properly-versioned data
(that would have been available in real-time) to forecast the 7 day
average of future COVID-19 case rates from current and past COVID-19
case rates and death rates for all states. That is, we can make
forecasts on the archive, `x`

, and compare those to forecasts
on the latest data, `x_latest`

using the same general set-up
as above. Note that in this example, we use a geo-pooled approach (using
combined data from all US states and territories) to train our
model.

##
Download data using `{epidatr}`

```
# loading in the data
states <- "*"
confirmed_incidence_prop <- pub_covidcast(
source = "jhu-csse",
signals = "confirmed_incidence_prop",
time_type = "day",
geo_type = "state",
time_values = epirange(20200301, 20211231),
geo_values = states,
issues = epirange(20000101, 20211231)
) %>%
select(geo_value, time_value, version = issue, case_rate = value) %>%
arrange(geo_value, time_value) %>%
as_epi_archive(compactify = FALSE)
deaths_incidence_prop <- pub_covidcast(
source = "jhu-csse",
signals = "deaths_incidence_prop",
time_type = "day",
geo_type = "state",
time_values = epirange(20200301, 20211231),
geo_values = states,
issues = epirange(20000101, 20211231)
) %>%
select(geo_value, time_value, version = issue, death_rate = value) %>%
arrange(geo_value, time_value) %>%
as_epi_archive(compactify = FALSE)
x <- epix_merge(confirmed_incidence_prop, deaths_incidence_prop, sync = "locf")
x <- x %>%
epix_slide(
.versions = fc_time_values,
function(x, gk, rtv) {
x %>%
group_by(geo_value) %>%
epi_slide_mean(case_rate, .window_size = 7L) %>%
rename(case_rate_7d_av = slide_value_case_rate) %>%
epi_slide_mean(death_rate, ..window_size = 7L) %>%
rename(death_rate_7d_av = slide_value_death_rate) %>%
ungroup()
}
) %>%
rename(version = time_value) %>%
rename(
time_value = slide_value_time_value,
geo_value = slide_value_geo_value,
case_rate = slide_value_case_rate,
death_rate = slide_value_death_rate,
case_rate_7d_av = slide_value_case_rate_7d_av,
death_rate_7d_av = slide_value_death_rate_7d_av
) %>%
as_epi_archive(compactify = TRUE)
saveRDS(x$DT, file = "case_death_rate_archive.rds")
```

```
x <- readRDS("case_death_rate_archive.rds")
x <- as_epi_archive(x)
```

Here we specify the ARX model.

```
aheads <- c(7, 14, 21)
fc_time_values <- seq(
from = as.Date("2020-09-01"),
to = as.Date("2021-12-31"),
by = "1 month"
)
forecaster <- function(x) {
map(aheads, function(ahead) {
arx_forecaster(
epi_data = x,
outcome = "death_rate_7d_av",
predictors = c("death_rate_7d_av", "case_rate_7d_av"),
trainer = quantile_reg(),
args_list = arx_args_list(lags = c(0, 7, 14, 21), ahead = ahead)
)$predictions
}) %>%
bind_rows()
}
```

We can now use our forecaster function that we’ve created and use it
in the pipeline for forecasting the predictions. We store the
predictions into the `arx_preds`

variable and calculate the
most up to date version of the data in the epi archive and store it as
`x_latest`

.

```
arx_preds <- x %>%
epix_slide(
~ forecaster(.x),
.before = 120, .versions = fc_time_values
) %>%
mutate(engine_type = quantile_reg()$engine) %>%
mutate(ahead_val = target_date - forecast_date)
x_latest <- epix_as_of(x, version = max(x$versions_end))
```

Now we plot both the actual and predicted 7 day average of the death rate for the chosen states

## Code for the plot

```
states_to_show <- c("ca", "ny", "mi", "az")
fc_states <- arx_preds %>%
filter(geo_value %in% states_to_show) %>%
pivot_quantiles_wider(.pred_distn)
x_latest_states <- x_latest %>% filter(geo_value %in% states_to_show)
p2 <- ggplot(fc_states, aes(target_date, group = forecast_date)) +
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value), alpha = 0.4) +
geom_line(
data = x_latest_states, aes(x = time_value, y = death_rate_7d_av),
inherit.aes = FALSE, color = "gray50"
) +
geom_line(aes(y = .pred, color = geo_value)) +
geom_point(aes(y = .pred, color = geo_value), size = 0.5) +
geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
facet_wrap(~geo_value, scales = "free_y", ncol = 1L) +
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
labs(x = "Date", y = "7 day average COVID-19 death rates") +
theme(legend.position = "none")
```