Convert per-capita predictions to raw scale
Source:R/layer_population_scaling.R
layer_population_scaling.Rd
layer_population_scaling
creates a specification of a frosting layer
that will "undo" per-capita scaling. Typical usage would
load a dataset that contains state-level population, and use it to convert
predictions made from a rate-scale model to raw scale by multiplying by
the population.
Although, it is worth noting that there is nothing special about "population".
The function can be used to scale by any variable. Population is the
standard use case in the epidemiology forecasting scenario. Any value
passed will multiply the selected variables while the rate_rescaling
argument is a common divisor of the selected variables.
Usage
layer_population_scaling(
frosting,
...,
df,
by = NULL,
df_pop_col,
rate_rescaling = 1,
create_new = TRUE,
suffix = "_scaled",
id = rand_id("population_scaling")
)
Arguments
- frosting
a
frosting
postprocessor. The layer will be added to the sequence of operations for this frosting.- ...
One or more selector functions to scale variables for this step. See
recipes::selections()
for more details.- df
a data frame that contains the population data to be used for inverting the existing scaling.
- by
A (possibly named) character vector of variables to join by.
If
NULL
, the default, the function will perform a natural join, using all variables in common across theepi_df
produced by thepredict()
call and the user-provided dataset. If columns in thatepi_df
anddf
have the same name (and aren't included inby
),.df
is added to the one from the user-provided data to disambiguate.To join by different variables on the
epi_df
anddf
, use a named vector. For example,by = c("geo_value" = "states")
will matchepi_df$geo_value
todf$states
. To join by multiple variables, use a vector with length > 1. For example,by = c("geo_value" = "states", "county" = "county")
will matchepi_df$geo_value
todf$states
andepi_df$county
todf$county
.See
dplyr::left_join()
for more details.- df_pop_col
the name of the column in the data frame
df
that contains the population data and used for scaling.- rate_rescaling
Sometimes rates are "per 100K" or "per 1M" rather than "per person". Adjustments can be made here. For example, if the original rate is "per 100K", then set
rate_rescaling = 1e5
to get counts back.- create_new
TRUE to create a new column and keep the original column in the
epi_df
.- suffix
a character. The suffix added to the column name if
create_new = TRUE
. Default to "_scaled".- id
a random id string
Examples
library(dplyr)
jhu <- cases_deaths_subset %>%
filter(time_value > "2021-11-01", geo_value %in% c("ca", "ny")) %>%
select(geo_value, time_value, cases)
pop_data <- data.frame(states = c("ca", "ny"), value = c(20000, 30000))
r <- epi_recipe(jhu) %>%
step_population_scaling(
df = pop_data,
df_pop_col = "value",
by = c("geo_value" = "states"),
cases, suffix = "_scaled"
) %>%
step_epi_lag(cases_scaled, lag = c(0, 7, 14)) %>%
step_epi_ahead(cases_scaled, ahead = 7, role = "outcome") %>%
step_epi_naomit()
f <- frosting() %>%
layer_predict() %>%
layer_threshold(.pred) %>%
layer_naomit(.pred) %>%
layer_population_scaling(.pred,
df = pop_data,
by = c("geo_value" = "states"),
df_pop_col = "value"
)
wf <- epi_workflow(r, linear_reg()) %>%
fit(jhu) %>%
add_frosting(f)
forecast(wf)
#> An `epi_df` object, 2 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 2 × 4
#> geo_value time_value .pred .pred_scaled
#> * <chr> <date> <dbl> <dbl>
#> 1 ca 2021-12-31 4.25 84938.
#> 2 ny 2021-12-31 5.93 177766.