Skip to contents

Based on the longest lag period in the recipe, get_test_data() creates an epi_df with columns geo_value, time_value and other variables in the original dataset, which will be used to create features necessary to produce forecasts.


  fill_locf = FALSE,
  n_recent = NULL,
  forecast_date = max(x$time_value)



A recipe object.


An epi_df. The typical usage is to pass the same data as that used for fitting the recipe.


Logical. Should we use locf to fill in missing data?


Integer or NULL. If filling missing data with locf = TRUE, how far back are we willing to tolerate missing data? Larger values allow more filling. The default NULL will determine this from the the recipe. For example, suppose n_recent = 3, then if the 3 most recent observations in any geo_value are all NA’s, we won’t be able to fill anything, and an error message will be thrown. (See details.)


By default, this is set to the maximum time_value in x. But if there is data latency such that recent NA's should be filled, this may be after the last available time_value.


An object of the same type as x with columns geo_value, time_value, any additional keys, as well other variables in the original dataset.


The minimum required (recent) data to produce a forecast is equal to the maximum lag requested (on any predictor) plus the longest horizon used if growth rate calculations are requested by the recipe. This is calculated internally.

It also optionally fills missing values using the last-observation-carried-forward (LOCF) method. If this is not possible (say because there would be only NA's in some location), it will produce an error suggesting alternative options to handle missing values with more advanced techniques.


# create recipe
rec <- epi_recipe(case_death_rate_subset) %>%
  step_epi_ahead(death_rate, ahead = 7) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
  step_epi_lag(case_rate, lag = c(0, 7, 14))
get_test_data(recipe = rec, x = case_death_rate_subset)
#> An `epi_df` object, 840 x 4 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 19:08:25.791826
#> # A tibble: 840 × 4
#>    geo_value time_value case_rate death_rate
#>  * <chr>     <date>         <dbl>      <dbl>
#>  1 ak        2021-12-17      23.1      1.19 
#>  2 al        2021-12-17      15.6      0.290
#>  3 ar        2021-12-17      23.4      0.467
#>  4 as        2021-12-17       0        0    
#>  5 az        2021-12-17      41.2      1.04 
#>  6 ca        2021-12-17      16.9      0.158
#>  7 co        2021-12-17      30.5      0.578
#>  8 ct        2021-12-17      64.8      0.120
#>  9 dc        2021-12-17      50.4      0.140
#> 10 de        2021-12-17      67.9      0.333
#> # ℹ 830 more rows