step_climate()
creates a specification of a recipe step that will
generate one or more new columns of derived data. This step examines all
available seasons in the training data and calculates the a measure of center
for the "typical" season. Think of this like with the weather: to predict the
temperature in January in Pittsburgh, PA, I might look at all previous
January's on record, average their temperatures, and include that in my
model. So it is important to align the forecast horizon with the climate.
This step will work best if added after step_epi_ahead()
, but that is not
strictly required. See the details for more information.
Arguments
- recipe
A recipe object. The step will be added to the sequence of operations for this recipe.
- ...
One or more selector functions to choose variables for this step. See
recipes::selections()
for more details.- forecast_ahead
The forecast horizon. By default, this step will try to detect whether a forecast horizon has already been specified with
step_epi_ahead()
. Alternatively, one can specify an explicit horizon with a scalar integer. Auto-detection is only possible when the time type of theepi_df
used to create theepi_recipe
is the same as the aggregationtime_type
specified in this step (say, both daily or both weekly). If, for example, daily data is used with monthly time aggregation, then auto-detection is not possible (and may in fact lead to strange behaviour even ifforecast_ahead
is specified with an integer). See details below.- role
What role should be assigned for any variables created by this step? "predictor" is the most likely choice.
- time_type
The duration over which time aggregation should be performed.
- center_method
The measure of center to be calculated over the time window.
- window_size
Scalar integer. How many time units on each side should be included. For example, if
window_size = 3
andtime_type = "day"
, then on each day in the data, the center will be calculated using 3 days before and three days after. So, in this case, it operates like a weekly rolling average, centered at each day.- epi_keys
Character vector or
NULL
. Any columns mentioned will be grouped before performing any center calculation. So for example, given state-level data, a national climate would be calculated ifNULL
, but passingepi_keys = "geo_value"
would calculate the climate separately by state.- prefix
A character string that will be prefixed to the new column.
- skip
A logical. Should the step be skipped when the recipe is baked by
bake()
? While all operations are baked whenprep()
is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when usingskip = TRUE
as it may affect the computations for subsequent operations.- id
A unique identifier for the step
Value
An updated version of recipe
with the new step added to the
sequence of any existing operations.
Details
Construction of a climate predictor can be helpful with strongly seasonal data. But its utility is greatest when the estimated "climate" is aligned to the forecast horizon. For example, if today is December 1, and we want to make a prediction for December 15, we want to know the climate for the week of December 15 to use in our model. But we also want to align the rest of our training data with the climate 2 weeks after those dates.
To accomplish
this, if we have daily data, we could use time_type = "week"
and
forecast_ahead = 2
. The climate predictor would be created by taking
averages over each week (with a window of a few weeks before and after, as
determined by window_size
), and then aligning these with the appropriate dates
in the training data so that each time_value
will "see" the typical climate 2
weeks in the future.
Alternatively, in the same scenario, we could use time_type = "day"
and
forecast_ahead = 14
. The climate predictor would be created by taking
averages over a small window around each day, and then aligning these with
the appropriate dates in the training data so that each time_value
will
"see" the climate 14 days in the future.
The only differences between these options is the type of averaging performed over the historical data. In the first case, days in the same week will get the same value of the climate predictor (because we're looking at weekly windows), while in the second case, every day in the data will have the average climate for the day that happens 14 days in the future.
Autodetecting the forecast horizon can only be guaranteed to work correctly
when the time types are the same: for example using daily data for training
and daily climate calculations. However, using weekly data, predicting 4
weeks ahead, and setting time_type = "month"
is perfectly reasonable. It's
just that the climate is calculated over months (January, February, March,
etc.) so how to properly align this when producing a forecast for the 5th week
in the year is challenging. For scenarios like these, it may be best to
approximately match the times with forecast_ahead = 1
, for example.
Examples
# automatically detects the horizon
r <- epi_recipe(covid_case_death_rates) %>%
step_epi_ahead(death_rate, ahead = 7) %>%
step_climate(death_rate, time_type = "day")
r
#>
#> ── Epi Recipe ──────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> raw: 2
#> geo_value: 1
#> time_value: 1
#>
#> ── Operations
#> 1. Leading: death_rate by 7
#> 2. Calculating climate_predictor for: death_rate by day using the median
r %>%
prep(covid_case_death_rates) %>%
bake(new_data = NULL)
#> An `epi_df` object, 20,888 x 6 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2023-03-10
#>
#> # A tibble: 20,888 × 6
#> geo_value time_value case_rate death_rate ahead_7_death_rate
#> * <chr> <date> <dbl> <dbl> <dbl>
#> 1 ak 2020-12-24 NA NA 0.158
#> 2 al 2020-12-24 NA NA 0.438
#> 3 ar 2020-12-24 NA NA 1.27
#> 4 as 2020-12-24 NA NA 0
#> 5 az 2020-12-24 NA NA 1.10
#> 6 ca 2020-12-24 NA NA 0.755
#> 7 co 2020-12-24 NA NA 0.376
#> 8 ct 2020-12-24 NA NA 0.819
#> 9 dc 2020-12-24 NA NA 0.601
#> 10 de 2020-12-24 NA NA 0.912
#> # ℹ 20,878 more rows
#> # ℹ 1 more variable: climate_death_rate <dbl>
# same idea, but using weekly climate
r <- epi_recipe(covid_case_death_rates) %>%
step_epi_ahead(death_rate, ahead = 7) %>%
step_climate(death_rate,
forecast_ahead = 1, time_type = "epiweek",
window_size = 1L
)
r
#>
#> ── Epi Recipe ──────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> raw: 2
#> geo_value: 1
#> time_value: 1
#>
#> ── Operations
#> 1. Leading: death_rate by 7
#> 2. Calculating climate_predictor for: death_rate by epiweek using the median
r %>%
prep(covid_case_death_rates) %>%
bake(new_data = NULL)
#> An `epi_df` object, 20,888 x 6 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2023-03-10
#>
#> # A tibble: 20,888 × 6
#> geo_value time_value case_rate death_rate ahead_7_death_rate
#> * <chr> <date> <dbl> <dbl> <dbl>
#> 1 ak 2020-12-24 NA NA 0.158
#> 2 al 2020-12-24 NA NA 0.438
#> 3 ar 2020-12-24 NA NA 1.27
#> 4 as 2020-12-24 NA NA 0
#> 5 az 2020-12-24 NA NA 1.10
#> 6 ca 2020-12-24 NA NA 0.755
#> 7 co 2020-12-24 NA NA 0.376
#> 8 ct 2020-12-24 NA NA 0.819
#> 9 dc 2020-12-24 NA NA 0.601
#> 10 de 2020-12-24 NA NA 0.912
#> # ℹ 20,878 more rows
#> # ℹ 1 more variable: climate_death_rate <dbl>
# switching the order is possible if you specify `forecast_ahead`
r <- epi_recipe(covid_case_death_rates) %>%
step_climate(death_rate, forecast_ahead = 7, time_type = "day") %>%
step_epi_ahead(death_rate, ahead = 7)
r
#>
#> ── Epi Recipe ──────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> raw: 2
#> geo_value: 1
#> time_value: 1
#>
#> ── Operations
#> 1. Calculating climate_predictor for: death_rate by day using the median
#> 2. Leading: death_rate by 7
r %>%
prep(covid_case_death_rates) %>%
bake(new_data = NULL)
#> An `epi_df` object, 20,888 x 6 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2023-03-10
#>
#> # A tibble: 20,888 × 6
#> geo_value time_value case_rate death_rate climate_death_rate
#> * <chr> <date> <dbl> <dbl> <dbl>
#> 1 ak 2020-12-24 NA NA NA
#> 2 al 2020-12-24 NA NA NA
#> 3 ar 2020-12-24 NA NA NA
#> 4 as 2020-12-24 NA NA NA
#> 5 az 2020-12-24 NA NA NA
#> 6 ca 2020-12-24 NA NA NA
#> 7 co 2020-12-24 NA NA NA
#> 8 ct 2020-12-24 NA NA NA
#> 9 dc 2020-12-24 NA NA NA
#> 10 de 2020-12-24 NA NA NA
#> # ℹ 20,878 more rows
#> # ℹ 1 more variable: ahead_7_death_rate <dbl>