Convert raw scale predictions to per-capita
Source:R/step_population_scaling.R
step_population_scaling.Rd
step_population_scaling
creates a specification of a recipe step
that will perform per-capita scaling. Typical usage would
load a dataset that contains state-level population, and use it to convert
predictions made from a raw scale model to rate-scale by dividing by
the population.
Although, it is worth noting that there is nothing special about "population".
The function can be used to scale by any variable. Population is the
standard use case in the epidemiology forecasting scenario. Any value
passed will divide the selected variables while the rate_rescaling
argument is a common multiplier of the selected variables.
Usage
step_population_scaling(
recipe,
...,
role = "raw",
df,
by = NULL,
df_pop_col,
rate_rescaling = 1,
create_new = TRUE,
suffix = "_scaled",
skip = FALSE,
id = rand_id("population_scaling")
)
Arguments
- recipe
A recipe object. The step will be added to the sequence of operations for this recipe.
- ...
One or more selector functions to choose variables for this step. See
recipes::selections()
for more details.- role
For model terms created by this step, what analysis role should they be assigned?
lag
is default a predictor whileahead
is an outcome.- df
a data frame that contains the population data to be used for inverting the existing scaling.
- by
A (possibly named) character vector of variables to join by.
If
NULL
, the default, the function will try to infer a reasonable set of columns. First, it will try to join by all variables in the training/test data with roles"geo_value"
,"key"
, or"time_value"
that also appear indf
; these roles are automatically set if you are using anepi_df
, or you can use, e.g.,update_role
. If no such roles are set, it will try to perform a natural join, using variables in common between the training/test data and population data.If columns in the training/testing data and
df
have the same name (and aren't included inby
), a.df
suffix is added to the one from the user-provided data to disambiguate.To join by different variables on the
epi_df
anddf
, use a named vector. For example,by = c("geo_value" = "states")
will matchepi_df$geo_value
todf$states
. To join by multiple variables, use a vector with length > 1. For example,by = c("geo_value" = "states", "county" = "county")
will matchepi_df$geo_value
todf$states
andepi_df$county
todf$county
.See
dplyr::inner_join()
for more details.- df_pop_col
the name of the column in the data frame
df
that contains the population data and will be used for scaling. This should be one column.- rate_rescaling
Sometimes raw scales are "per 100K" or "per 1M". Adjustments can be made here. For example, if the original scale is "per 100K", then set
rate_rescaling = 1e5
to get rates.- create_new
TRUE to create a new column and keep the original column in the
epi_df
- suffix
a character. The suffix added to the column name if
create_new = TRUE
. Default to "_scaled".- skip
A logical. Should the step be skipped when the recipe is baked by
bake()
? While all operations are baked whenprep()
is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when usingskip = TRUE
as it may affect the computations for subsequent operations.- id
A unique identifier for the step
Examples
library(dplyr)
jhu <- epidatasets::cases_deaths_subset %>%
filter(time_value > "2021-11-01", geo_value %in% c("ca", "ny")) %>%
select(geo_value, time_value, cases)
pop_data <- data.frame(states = c("ca", "ny"), value = c(20000, 30000))
r <- epi_recipe(jhu) %>%
step_population_scaling(
df = pop_data,
df_pop_col = "value",
by = c("geo_value" = "states"),
cases, suffix = "_scaled"
) %>%
step_epi_lag(cases_scaled, lag = c(0, 7, 14)) %>%
step_epi_ahead(cases_scaled, ahead = 7, role = "outcome") %>%
step_epi_naomit()
f <- frosting() %>%
layer_predict() %>%
layer_threshold(.pred) %>%
layer_naomit(.pred) %>%
layer_population_scaling(.pred,
df = pop_data,
by = c("geo_value" = "states"),
df_pop_col = "value"
)
wf <- epi_workflow(r, linear_reg()) %>%
fit(jhu) %>%
add_frosting(f)
forecast(wf)
#> An `epi_df` object, 2 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * other_keys = geo_value, time_value
#> * as_of = 2024-03-20
#>
#> # A tibble: 2 × 4
#> geo_value time_value .pred .pred_scaled
#> * <chr> <date> <dbl> <dbl>
#> 1 ca 2021-12-31 4.25 84938.
#> 2 ny 2021-12-31 5.93 177766.