Based on the longest lag period in the recipe,
get_test_data()
creates an epi_df
with columns geo_value
, time_value
and other variables in the original dataset,
which will be used to create features necessary to produce forecasts.
Usage
get_test_data(
recipe,
x,
fill_locf = FALSE,
n_recent = NULL,
forecast_date = max(x$time_value)
)
Arguments
- recipe
A recipe object.
- x
An epi_df. The typical usage is to pass the same data as that used for fitting the recipe.
- fill_locf
Logical. Should we use
locf
to fill in missing data?- n_recent
Integer or NULL. If filling missing data with
locf = TRUE
, how far back are we willing to tolerate missing data? Larger values allow more filling. The defaultNULL
will determine this from the therecipe
. For example, supposen_recent = 3
, then if the 3 most recent observations in anygeo_value
are allNA
’s, we won’t be able to fill anything, and an error message will be thrown. (See details.)- forecast_date
By default, this is set to the maximum
time_value
inx
. But if there is data latency such that recentNA
's should be filled, this may be after the last availabletime_value
.
Value
An object of the same type as x
with columns geo_value
, time_value
, any additional
keys, as well other variables in the original dataset.
Details
The minimum required (recent) data to produce a forecast is equal to the maximum lag requested (on any predictor) plus the longest horizon used if growth rate calculations are requested by the recipe. This is calculated internally.
It also optionally fills missing values
using the last-observation-carried-forward (LOCF) method. If this
is not possible (say because there would be only NA
's in some location),
it will produce an error suggesting alternative options to handle missing
values with more advanced techniques.
Examples
# create recipe
rec <- epi_recipe(case_death_rate_subset) %>%
step_epi_ahead(death_rate, ahead = 7) %>%
step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
step_epi_lag(case_rate, lag = c(0, 7, 14))
get_test_data(recipe = rec, x = case_death_rate_subset)
#> An `epi_df` object, 840 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2022-05-31 19:08:25.791826
#>
#> # A tibble: 840 × 4
#> geo_value time_value case_rate death_rate
#> * <chr> <date> <dbl> <dbl>
#> 1 ak 2021-12-17 23.1 1.19
#> 2 al 2021-12-17 15.6 0.290
#> 3 ar 2021-12-17 23.4 0.467
#> 4 as 2021-12-17 0 0
#> 5 az 2021-12-17 41.2 1.04
#> 6 ca 2021-12-17 16.9 0.158
#> 7 co 2021-12-17 30.5 0.578
#> 8 ct 2021-12-17 64.8 0.120
#> 9 dc 2021-12-17 50.4 0.140
#> 10 de 2021-12-17 67.9 0.333
#> # ℹ 830 more rows