10  Introducing the flatline forecaster

The flatline forecaster is a very simple forecasting model intended for epi_df data, where the most recent observation is used as the forecast for any future date. In other words, the last observation is propagated forward. Hence, a flat line phenomenon is observed for the point predictions. The predictive intervals are produced from the quantiles of the residuals of such a forecast over all of the training data. By default, these intervals will be obtained separately for each combination of keys (geo_value and any additional keys) in the epi_df. Thus, the output is a data frame of point (and optionally interval) forecasts at a single unique horizon (ahead) for each unique combination of key variables. This forecaster is comparable to the baseline used by the COVID Forecast Hub.

10.1 Example of using the flatline forecaster

We will continue to use the case_death_rate_subset dataset that comes with the epipredict package. In brief, this is a subset of the JHU daily COVID-19 cases and deaths by state. While this dataset ranges from Dec 31, 2020 to Dec 31, 2021, we will only consider a small subset at the end of that range to keep our example relatively simple.

jhu <- case_death_rate_subset %>%
  dplyr::filter(time_value >= as.Date("2021-09-01"))

jhu
#> An `epi_df` object, 6,832 x 4 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 12:08:25.791826
#> 
#> # A tibble: 6,832 × 4
#>   geo_value time_value case_rate death_rate
#> * <chr>     <date>         <dbl>      <dbl>
#> 1 ak        2021-09-01      75.3      0.198
#> 2 al        2021-09-01     113.       0.845
#> 3 ar        2021-09-01      68.5      0.919
#> 4 as        2021-09-01       0        0    
#> 5 az        2021-09-01      48.8      0.414
#> 6 ca        2021-09-01      38.4      0.246
#> # ℹ 6,826 more rows

10.1.1 The basic mechanics of the flatline forecaster

The simplest way to create and train a flatline forecaster to predict the d eath rate one week into the future, is to input the epi_df and the name of the column from it that we want to predict in the flatline_forecaster function.

one_week_ahead <- flatline_forecaster(jhu, outcome = "death_rate")
one_week_ahead
#> 
#> ══ A basic forecaster of type flatline ══════════════════════════════════════
#> 
#> This forecaster was fit on 2023-06-19 20:31:13
#> 
#> Training data was an `epi_df` with
#> • Geography: state,
#> • Time type: day,
#> • Using data up-to-date as of: 2022-05-31 12:08:25.
#> 
#> ── Predictions ──────────────────────────────────────────────────────────────
#> 
#> A total of 56 predictions are available for
#> • 56 unique geographic regions,
#> • At forecast dates: 2021-12-31,
#> • For target dates: 2022-01-07.

The result is both a fitted model object which could be used any time in the future to create different forecasts, as well as a set of predicted values and prediction intervals for each location 7 days after the last available time value in the data, which is Dec 31, 2021. Note that 7 days is the default number of time steps ahead of the forecast date in which forecasts should be produced. To change this, you must change the value of the ahead parameter in the list of additional arguments flatline_args_list(). Let’s change this to 5 days to get some practice.

five_days_ahead <- flatline_forecaster(
  jhu,
  outcome = "death_rate",
  flatline_args_list(ahead = 5L)
)

five_days_ahead
#> 
#> ══ A basic forecaster of type flatline ══════════════════════════════════════
#> 
#> This forecaster was fit on 2023-06-19 20:31:14
#> 
#> Training data was an `epi_df` with
#> • Geography: state,
#> • Time type: day,
#> • Using data up-to-date as of: 2022-05-31 12:08:25.
#> 
#> ── Predictions ──────────────────────────────────────────────────────────────
#> 
#> A total of 56 predictions are available for
#> • 56 unique geographic regions,
#> • At forecast dates: 2021-12-31,
#> • For target dates: 2022-01-05.

We could also specify that we want a 80% predictive interval by changing the levels. The default 0.05 and 0.95 levels/quantiles give us 90% predictive interval.

five_days_ahead <- flatline_forecaster(
  jhu,
  outcome = "death_rate",
  flatline_args_list(ahead = 5L, levels = c(0.1, 0.9))
)

five_days_ahead
#> 
#> ══ A basic forecaster of type flatline ══════════════════════════════════════
#> 
#> This forecaster was fit on 2023-06-19 20:31:14
#> 
#> Training data was an `epi_df` with
#> • Geography: state,
#> • Time type: day,
#> • Using data up-to-date as of: 2022-05-31 12:08:25.
#> 
#> ── Predictions ──────────────────────────────────────────────────────────────
#> 
#> A total of 56 predictions are available for
#> • 56 unique geographic regions,
#> • At forecast dates: 2021-12-31,
#> • For target dates: 2022-01-05.

To see the other arguments that you may modify, please see ?flatline_args_list(). For now, we will move on to looking at the workflow.

five_days_ahead$epi_workflow
#> ══ Epi Workflow [trained] ═══════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> Postprocessor: Frosting
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────
#> 2 Recipe Steps
#> 
#> • step_epi_ahead()
#> • step_training_window()
#> 
#> ── Model ────────────────────────────────────────────────────────────────────
#> Flatline forecaster
#> 
#> Predictions produced by geo_value resulting in 56 total forecasts.
#> A total of 7112 residuals are available from the training set.
#> 
#> ── Postprocessor ────────────────────────────────────────────────────────────
#> 5 Frosting Layers
#> 
#> • layer_predict()
#> • layer_residual_quantiles()
#> • layer_add_forecast_date()
#> • layer_add_target_date()
#> • layer_threshold()

The fitted model here was based on minimal pre-processing of the data, estimating a flatline model, and then post-processing the results to be meaningful for epidemiological tasks. To look deeper into the pre-processing, model and processing parts individually, you may use the $ operator after epi_workflow. For example, let’s examine the pre-processing part in more detail.

library(workflows)
extract_preprocessor(five_days_ahead$epi_workflow)
#> 
#> ── Recipe ───────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> predictor:  3
#> geo_value:  1
#> raw:        1
#> time_value: 1
#> 
#> ── Operations
#> • Leading: death_rate by 5
#> • # of recent observations per key limited to:: Inf

Under Operations, we can see that the pre-processing operations were to lead the death rate by 5 days (step_epi_ahead()) and that the # of recent observations used in the training window were not limited (in step_training_window() as n_training = Inf in flatline_args_list()). You should also see the molded/pre-processed training data.

For symmetry, let’s have a look at the post-processing.

extract_frosting(five_days_ahead$epi_workflow)
#> 
#> ── Frosting ─────────────────────────────────────────────────────────────────
#> 
#> ── Layers
#> • Creating predictions: "<calculated>"
#> • Resampling residuals for predictive quantiles: "<calculated>" levels 0.1,
#>   0.9
#> • Adding forecast date: "2021-12-31"
#> • Adding target date: "2022-01-05"
#> • Thresholding predictions: dplyr::starts_with(".pred") to ]0, Inf)

The post-processing operations in the order the that were performed were to create the predictions and the predictive intervals, add the forecast and target dates and bound the predictions at zero.

We can also easily examine the predictions themselves.

five_days_ahead$predictions
#> # A tibble: 56 × 5
#>   geo_value  .pred       .pred_distn forecast_date target_date
#>   <chr>      <dbl>            <dist> <date>        <date>     
#> 1 ak        0.0395 [0.1, 0.9]<q-rng> 2021-12-31    2022-01-05 
#> 2 al        0.107  [0.1, 0.9]<q-rng> 2021-12-31    2022-01-05 
#> 3 ar        0.490  [0.1, 0.9]<q-rng> 2021-12-31    2022-01-05 
#> 4 as        0      [0.1, 0.9]<q-rng> 2021-12-31    2022-01-05 
#> 5 az        0.608  [0.1, 0.9]<q-rng> 2021-12-31    2022-01-05 
#> 6 ca        0.142  [0.1, 0.9]<q-rng> 2021-12-31    2022-01-05 
#> # ℹ 50 more rows

The results above show a distributional forecast produced using data through the end of 2021 for the January 5, 2022. A prediction for the death rate per 100K inhabitants along with a 95% predictive interval is available for every state (geo_value).

The figure below displays the prediction and prediction interval for three sample states: Arizona, New York, and Florida.

Code
samp_geos <- c("az", "ny", "fl")

hist <- jhu %>%
  filter(geo_value %in% samp_geos)

preds <- five_days_ahead$predictions %>%
  filter(geo_value %in% samp_geos) %>%
  mutate(q = nested_quantiles(.pred_distn)) %>%
  unnest(q) %>%
  pivot_wider(names_from = tau, values_from = q)

ggplot(hist, aes(color = geo_value)) +
  geom_line(aes(time_value, death_rate)) +
  theme_bw() +
  geom_errorbar(data = preds, aes(x = target_date, ymin = `0.1`, ymax = `0.9`)) +
  geom_point(data = preds, aes(target_date, .pred)) +
  geom_vline(data = preds, aes(xintercept = forecast_date)) +
  scale_colour_viridis_d(name = "") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  facet_grid(geo_value ~ ., scales = "free_y") +
  theme(legend.position = "none") +
  labs(x = "", y = "Incident deaths per 100K\n inhabitants")

The vertical black line is the forecast date. Here the forecast seems pretty reasonable based on the past observations shown. In cases where the recent past is highly predictive of the near future, a simple flatline forecast may be respectable, but in more complex situations where there is more uncertainty of what’s to come, the flatline forecaster may be best relegated to being a baseline model and nothing more.

Take for example what happens when we consider a wider range of target dates. That is, we will now predict for several different horizons or ahead values - in our case, 5 to 25 days ahead, inclusive. Since the flatline forecaster function forecasts at a single unique ahead value, we can use the map() function from purrr to apply the forecaster to each ahead value we want to use. Then, we row bind the list of results.

out_df <- map(1:28, ~ flatline_forecaster(
  epi_data = jhu,
  outcome = "death_rate",
  args_list = flatline_args_list(ahead = .x)
)$predictions) %>%
  list_rbind()

Then, we proceed as we did before. The only difference from before is that we’re using out_df where we had five_days_ahead$predictions.

Code
preds <- out_df %>%
  filter(geo_value %in% samp_geos) %>%
  mutate(q = nested_quantiles(.pred_distn)) %>%
  unnest(q) %>%
  pivot_wider(names_from = tau, values_from = q)

ggplot(hist) +
  geom_line(aes(time_value, death_rate)) +
  geom_ribbon(
    data = preds,
    aes(x = target_date, ymin = `0.05`, ymax = `0.95`, fill = geo_value)
  ) +
  geom_point(data = preds, aes(target_date, .pred, colour = geo_value)) +
  geom_vline(data = preds, aes(xintercept = forecast_date)) +
  scale_colour_viridis_d() +
  scale_fill_viridis_d(alpha = .4) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  scale_y_continuous(expand = expansion(c(0, .05))) +
  facet_grid(geo_value ~ ., scales = "free_y") +
  labs(x = "", y = "Incident deaths per 100K\n inhabitants") +
  theme(legend.position = "none")

Now, you can really see the flat line trend in the predictions. And you may also observe that as we get further away from the forecast date, the more unnerving using a flatline prediction becomes. It feels increasingly unnatural.

So naturally the choice of forecaster relates to the time frame being considered. In general, using a flatline forecaster makes more sense for short-term forecasts than for long-term forecasts and for periods of great stability than in less stable times. Realistically, periods of great stability are rare. Moreover, in our model of choice we want to take into account more information about the past than just what happened at the most recent time point. So simple forecasters like the flatline forecaster don’t cut it as actual contenders in many real-life situations. However, they are not useless, just used for a different purpose. A simple model is often used to compare a more complex model to, which is why you may have seen such a model used as a baseline in the COVID Forecast Hub. The following blog post from Delphi explores the Hub’s ensemble accuracy relative to such a baseline model.

10.2 What we’ve learned in a nutshell

Though the flatline forecaster is a very basic model with limited customization, it is about as steady and predictable as a model can get. So it provides a good reference or baseline to compare more complicated models to.