Skip to contents

Panel data, or longitudinal data, contain cross-sectional measurements of subjects over time. The epipredict package is most suitable for running forecasters on epidemiological panel data. A built-in example of this is the case_death_rate_subset dataset, which contains daily state-wise measures of case_rate and death_rate for COVID-19 in 2021:

head(case_death_rate_subset, 3)
#> An `epi_df` object, 3 x 4 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 19:08:25.791826
#> 
#> # A tibble: 3 × 4
#>   geo_value time_value case_rate death_rate
#> * <chr>     <date>         <dbl>      <dbl>
#> 1 ak        2020-12-31      35.9      0.158
#> 2 al        2020-12-31      65.1      0.438
#> 3 ar        2020-12-31      66.0      1.27

epipredict functions work with data in epi_df format. Despite the stated goal and name of the package, other panel datasets are also valid candidates for epipredict functionality, as long as they are in epi_df format.

Example panel data overview

In this vignette, we will demonstrate using epipredict with employment panel data from Statistics Canada. We will be using Table 37-10-0115-01: Characteristics and median employment income of longitudinal cohorts of postsecondary graduates two and five years after graduation, by educational qualification and field of study (primary groupings).

The full dataset contains yearly median employment income two and five years after graduation, and number of graduates. The data is stratified by variables such as geographic region (Canadian province), education, and age group. The year range of the dataset is 2010 to 2017, inclusive. The full dataset also contains metadata that describes the quality of data collected. For demonstration purposes, we make the following modifications to get a subset of the full dataset:

  • Only keep provincial-level geographic region (the full data also has “Canada” as a region)
  • Only keep “good” or better quality data rows, as indicated by the STATUS column
  • Choose a subset of covariates and aggregate across the remaining ones. The chosen covariates are age group, and educational qualification.

To use this data with epipredict, we need to convert it into epi_df format using epiprocess::as_epi_df() with additional keys. In our case, the additional keys are age_group, and edu_qual. Note that in the above modifications, we encoded time_value as type integer. This lets us set time_type = "year", and ensures that lag and ahead modifications later on are using the correct time units. See the epiprocess::epi_df for a list of all the time_types available.

Now, we are ready to use grad_employ_subset with epipredict. Our epi_df contains 1,445 rows and 7 columns. Here is a quick summary of the columns in our epi_df:

  • time_value (time value): year in date format
  • geo_value (geo value): province in Canada
  • num_graduates (raw, time series value): number of graduates
  • med_income_2y (raw, time series value): median employment income 2 years after graduation
  • med_income_5y (raw, time series value): median employment income 5 years after graduation
  • age_group (key): one of two age groups, either 15 to 34 years, or 35 to 64 years
  • edu_qual (key): one of 32 unique educational qualifications, e.g., “Master’s diploma”
# Rename for simplicity
employ <- grad_employ_subset
sample_n(employ, 6)
#> An `epi_df` object, 6 x 7 with metadata:
#> * geo_type  = custom
#> * time_type = integer
#> * other_keys = age_group, edu_qual
#> * as_of     = 2022-07-19
#> 
#> # A tibble: 6 × 7
#>   geo_value        age_group     edu_qual time_value num_graduates med_income_2y
#> * <chr>            <fct>         <fct>         <int>         <dbl>         <dbl>
#> 1 Saskatchewan     35 to 64 yea… Undergr…       2016           120         69800
#> 2 British Columbia 35 to 64 yea… Post-ba…       2017           240         65000
#> 3 Saskatchewan     35 to 64 yea… Post-ba…       2012            10         96800
#> 4 Quebec           15 to 34 yea… Master'…       2010            80         65500
#> 5 British Columbia 35 to 64 yea… Career,…       2012          3060         43000
#> 6 New Brunswick    35 to 64 yea… Career,…       2013           230         26700
#> # ℹ 1 more variable: med_income_5y <dbl>

In the following sections, we will go over pre-processing the data in the epi_recipe framework, and fitting a model and making predictions within the epipredict framework and using the package’s canned forecasters.

Autoregressive (AR) model to predict number of graduates in a year

Pre-processing

As a simple example, let’s work with the num_graduates column for now. We will first pre-process by standardizing each numeric column by the total within each group of keys. We do this since those raw numeric values will vary greatly from province to province since there are large differences in population.

employ_small <- employ %>%
  group_by(geo_value, age_group, edu_qual) %>%
  # Select groups where there are complete time series values
  filter(n() >= 6) %>%
  mutate(
    num_graduates_prop = num_graduates / sum(num_graduates),
    med_income_2y_prop = med_income_2y / sum(med_income_2y),
    med_income_5y_prop = med_income_5y / sum(med_income_5y)
  ) %>%
  ungroup()
head(employ_small)
#> An `epi_df` object, 6 x 10 with metadata:
#> * geo_type  = custom
#> * time_type = integer
#> * other_keys = age_group, edu_qual
#> * as_of     = 2022-07-19
#> 
#> # A tibble: 6 × 10
#>   geo_value            age_group edu_qual time_value num_graduates med_income_2y
#> * <chr>                <fct>     <fct>         <int>         <dbl>         <dbl>
#> 1 Newfoundland and La… 15 to 34… Career,…       2010           430         48800
#> 2 Newfoundland and La… 35 to 64… Career,…       2010           140         38100
#> 3 Newfoundland and La… 15 to 34… Career,…       2010           630         49500
#> 4 Newfoundland and La… 35 to 64… Career,…       2010           140         48400
#> 5 Newfoundland and La… 15 to 34… Undergr…       2010          1050         63600
#> 6 Newfoundland and La… 35 to 64… Undergr…       2010           130         85700
#> # ℹ 4 more variables: med_income_5y <dbl>, num_graduates_prop <dbl>,
#> #   med_income_2y_prop <dbl>, med_income_5y_prop <dbl>

Below is a visualization for a sample of the small data for British Columbia and Ontario. Note that some groups do not have any time series information since we filtered out all time series with incomplete dates.

employ_small %>%
  filter(geo_value %in% c("British Columbia", "Ontario")) %>%
  filter(grepl("degree", edu_qual, fixed = T)) %>%
  group_by(geo_value, time_value, edu_qual, age_group) %>%
  summarise(num_graduates_prop = sum(num_graduates_prop), .groups = "drop") %>%
  ggplot(aes(x = time_value, y = num_graduates_prop, color = geo_value)) +
  geom_line() +
  scale_colour_manual(values = c("Cornflowerblue", "Orange"), name = "") +
  facet_grid(rows = vars(edu_qual), cols = vars(age_group)) +
  xlab("Year") +
  ylab("Percentage of gratuates") +
  theme(legend.position = "bottom")

We will predict the standardized number of graduates (a proportion) in the next year (time t+1t+1) using an autoregressive model with three lags (i.e., an AR(3) model). Such a model is represented algebraically like this:

yt+1,ijk=α0+α1ytijk+α2yt1,ijk+α3yt2,ijk+ϵtijk y_{t+1,ijk} = \alpha_0 + \alpha_1 y_{tijk} + \alpha_2 y_{t-1,ijk} + \alpha_3 y_{t-2,ijk} + \epsilon_{tijk}

where ytijy_{tij} is the proportion of graduates at time tt in location ii and age group jj with education quality kk.

In the pre-processing step, we need to create additional columns in employ for each of yt+1,ijky_{t+1,ijk}, ytijky_{tijk}, yt1,ijky_{t-1,ijk}, and yt2,ijky_{t-2,ijk}. We do this via an epi_recipe. Note that creating an epi_recipe alone doesn’t add these outcome and predictor columns; the recipe just stores the instructions for adding them.

Our epi_recipe should add one ahead column representing yt+1,ijky_{t+1,ijk} and 3 lag columns representing ytijky_{tijk}, yt1,ijky_{t-1,ijk}, and yt2,ijky_{t-2,ijk} (it’s more accurate to think of the 0th “lag” as the “current” value with 2 lags, but that’s not quite how the processing works). Also note that since we specified our time_type to be year, our lag and lead values are both in years.

r <- epi_recipe(employ_small) %>%
  step_epi_ahead(num_graduates_prop, ahead = 1) %>%
  step_epi_lag(num_graduates_prop, lag = 0:2) %>%
  step_epi_naomit()
r
#> 
#> ── Epi Recipe ──────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> raw:        6
#> key:        2
#> geo_value:  1
#> time_value: 1
#> 
#> ── Operations
#> 1. Leading: num_graduates_prop by 1
#> 2. Lagging: num_graduates_prop by 0, 1, 2
#> 3.  Removing rows with NA values in: all_predictors()
#> 4.  Removing rows with NA values in: all_outcomes()

Let’s apply this recipe using prep and bake to generate and view the lag and ahead columns.

# Display a sample of the pre-processed data
bake_and_show_sample <- function(recipe, data, n = 5) {
  recipe %>%
    prep(data) %>%
    bake(new_data = data) %>%
    sample_n(n)
}

r %>% bake_and_show_sample(employ_small)
#> An `epi_df` object, 5 x 14 with metadata:
#> * geo_type  = custom
#> * time_type = integer
#> * other_keys = age_group, edu_qual
#> * as_of     = 2022-07-19
#> 
#> # A tibble: 5 × 14
#>   geo_value        age_group     edu_qual time_value num_graduates med_income_2y
#> * <chr>            <fct>         <fct>         <int>         <dbl>         <dbl>
#> 1 Ontario          35 to 64 yea… Master'…       2012            40        109300
#> 2 British Columbia 15 to 34 yea… Undergr…       2014         11520         51100
#> 3 Alberta          35 to 64 yea… Career,…       2013          1270         47400
#> 4 Alberta          35 to 64 yea… Undergr…       2016          1050         78400
#> 5 Quebec           15 to 34 yea… Career,…       2016          3920         39900
#> # ℹ 8 more variables: med_income_5y <dbl>, num_graduates_prop <dbl>,
#> #   med_income_2y_prop <dbl>, med_income_5y_prop <dbl>,
#> #   ahead_1_num_graduates_prop <dbl>, lag_0_num_graduates_prop <dbl>,
#> #   lag_1_num_graduates_prop <dbl>, lag_2_num_graduates_prop <dbl>

We can see that the prep and bake steps created new columns according to our epi_recipe:

  • ahead_1_num_graduates_prop corresponds to yt+1,ijky_{t+1,ijk}
  • lag_0_num_graduates_prop, lag_1_num_graduates_prop, and lag_2_num_graduates_prop correspond to ytijky_{tijk}, yt1,ijky_{t-1,ijk}, and yt2,ijky_{t-2,ijk} respectively.

Model fitting and prediction

Since our goal for now is to fit a simple autoregressive model, we can use parsnip::linear_reg() with the default engine lm, which fits a linear regression using ordinary least squares.

We will use epi_workflow with the epi_recipe we defined in the pre-processing section along with the parsnip::linear_reg() model. Note that epi_workflow is a container and doesn’t actually do the fitting. We have to pass the workflow into fit() to get our estimated model coefficients α̂i,i=0,...,3\widehat{\alpha}_i,\ i=0,...,3.

wf_linreg <- epi_workflow(r, linear_reg()) %>%
  fit(employ_small)
summary(extract_fit_engine(wf_linreg))
#> 
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.104501 -0.013043 -0.002708  0.009289  0.210582 
#> 
#> Coefficients:
#>                           Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               0.108532   0.006695  16.211  < 2e-16 ***
#> lag_0_num_graduates_prop  0.324251   0.037163   8.725  < 2e-16 ***
#> lag_1_num_graduates_prop  0.014190   0.038543   0.368 0.712848    
#> lag_2_num_graduates_prop -0.137378   0.036337  -3.781 0.000168 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.02993 on 777 degrees of freedom
#> Multiple R-squared:  0.1084, Adjusted R-squared:  0.1049 
#> F-statistic: 31.47 on 3 and 777 DF,  p-value: < 2.2e-16

This output tells us the coefficients of the fitted model; for instance, the estimated intercept is α̂0=\widehat{\alpha}_0 = 0.109 and the coefficient for ytijky_{tijk} is α̂1=\widehat\alpha_1 = 0.324. The summary also tells us that all estimated coefficients are significantly different from zero. Extracting the 95% confidence intervals for the coefficients also leads us to the same conclusion: all the coefficient estimates are significantly different from 0.

confint(extract_fit_engine(wf_linreg))
#>                                2.5 %      97.5 %
#> (Intercept)               0.09538942  0.12167466
#> lag_0_num_graduates_prop  0.25130008  0.39720211
#> lag_1_num_graduates_prop -0.06147071  0.08985152
#> lag_2_num_graduates_prop -0.20870743 -0.06604791

Now that we have our workflow, we can generate predictions from a subset of our data. For this demo, we will predict the number of graduates using the last 2 years of our dataset.

latest <- get_test_data(recipe = r, x = employ_small)
preds <- stats::predict(wf_linreg, latest) %>% filter(!is.na(.pred))
# Display a sample of the prediction values, excluding NAs
preds %>% sample_n(5)
#> An `epi_df` object, 5 x 5 with metadata:
#> * geo_type  = custom
#> * time_type = integer
#> * other_keys = age_group, edu_qual
#> * as_of     = 2022-07-19
#> 
#> # A tibble: 5 × 5
#>   geo_value        age_group      edu_qual                      time_value .pred
#> * <chr>            <fct>          <fct>                              <int> <dbl>
#> 1 British Columbia 35 to 64 years Professional degree                 2017 0.136
#> 2 Quebec           35 to 64 years Career, technical or profess…       2017 0.141
#> 3 British Columbia 35 to 64 years Career, technical or profess…       2017 0.135
#> 4 New Brunswick    35 to 64 years Undergraduate degree                2017 0.128
#> 5 Alberta          35 to 64 years Career, technical or profess…       2017 0.138

We can do this using the augment function too. Note that predict and augment both still return an epiprocess::epi_df with all of the keys that were present in the original dataset.

augment(wf_linreg, latest) %>% sample_n(5)
#> An `epi_df` object, 5 x 11 with metadata:
#> * geo_type  = custom
#> * time_type = integer
#> * other_keys = age_group, edu_qual
#> * as_of     = 2022-07-19
#> 
#> # A tibble: 5 × 11
#>   geo_value     age_group edu_qual time_value  .pred num_graduates med_income_2y
#> * <chr>         <fct>     <fct>         <int>  <dbl>         <dbl>         <dbl>
#> 1 Saskatchewan  15 to 34… Undergr…       2017  0.136          2690         64900
#> 2 Ontario       15 to 34… Post-ba…       2015 NA                30         48400
#> 3 Nova Scotia   15 to 34… Post ca…       2016 NA                80         50000
#> 4 Prince Edwar… 15 to 34… Undergr…       2017  0.134           340         50000
#> 5 Nova Scotia   15 to 34… Undergr…       2016 NA                90         53000
#> # ℹ 4 more variables: med_income_5y <dbl>, num_graduates_prop <dbl>,
#> #   med_income_2y_prop <dbl>, med_income_5y_prop <dbl>

Model diagnostics

First, we’ll plot the residuals (that is, ytijkŷtijky_{tijk} - \widehat{y}_{tijk}) against the fitted values (ŷtijk\widehat{y}_{tijk}).

par(mfrow = c(2, 2), mar = c(5, 3, 1.2, 0))
plot(extract_fit_engine(wf_linreg))

The fitted values vs. residuals plot shows us that the residuals are mostly clustered around zero, but do not form an even band around the zero line, indicating that the variance of the residuals is not constant. Additionally, the fitted values vs. square root of standardized residuals makes this more obvious - the spread of the square root of standardized residuals varies with the fitted values.

The Q-Q plot shows us that the residuals have heavier tails than a Normal distribution. So the normality of residuals assumption doesn’t hold either.

Finally, the residuals vs. leverage plot shows us that we have a few influential points based on the Cook’s distance (those outside the red dotted line).

Since we appear to be violating the linear model assumptions, we might consider transforming our data differently, or considering a non-linear model, or something else.

AR model with exogenous inputs

Now suppose we want to model the 1-step-ahead 5-year employment income using current and two previous values, while also incorporating information from the other two time-series in our dataset: the 2-year employment income and the number of graduates in the previous 2 years. We would do this using an autoregressive model with exogenous inputs, defined as follows:

yt+1,ijk=α0+α1ytijk+α2yt1,ijk+α3yt2,ijk+β1xtijk+β2xt1,ijk+γ2ztijk+γ2zt1,ijk+ϵtijk \begin{aligned} y_{t+1,ijk} &= \alpha_0 + \alpha_1 y_{tijk} + \alpha_2 y_{t-1,ijk} + \alpha_3 y_{t-2,ijk}\\ &\quad + \beta_1 x_{tijk} + \beta_2 x_{t-1,ijk}\\ &\quad + \gamma_2 z_{tijk} + \gamma_2 z_{t-1,ijk} + \epsilon_{tijk} \end{aligned}

where ytijky_{tijk} is the 5-year median income (proportion) at time tt (in location ii, age group jj with education quality kk), xtijkx_{tijk} is the 2-year median income (proportion) at time tt, and ztijkz_{tijk} is the number of graduates (proportion) at time tt.

Pre-processing

Again, we construct an epi_recipe detailing the pre-processing steps.

rx <- epi_recipe(employ_small) %>%
  step_epi_ahead(med_income_5y_prop, ahead = 1) %>%
  # 5-year median income has current, and two lags c(0, 1, 2)
  step_epi_lag(med_income_5y_prop, lag = 0:2) %>%
  # But the two exogenous variables have current values, and 1 lag c(0, 1)
  step_epi_lag(med_income_2y_prop, lag = c(0, 1)) %>%
  step_epi_lag(num_graduates_prop, lag = c(0, 1)) %>%
  step_epi_naomit()

bake_and_show_sample(rx, employ_small)
#> An `epi_df` object, 5 x 18 with metadata:
#> * geo_type  = custom
#> * time_type = integer
#> * other_keys = age_group, edu_qual
#> * as_of     = 2022-07-19
#> 
#> # A tibble: 5 × 18
#>   geo_value            age_group edu_qual time_value num_graduates med_income_2y
#> * <chr>                <fct>     <fct>         <int>         <dbl>         <dbl>
#> 1 Prince Edward Island 35 to 64… Undergr…       2017            10         69700
#> 2 British Columbia     35 to 64… Post-ba…       2014           180         74700
#> 3 Alberta              35 to 64… Career,…       2016          1250         46000
#> 4 Saskatchewan         15 to 34… Undergr…       2015          2600         65400
#> 5 Saskatchewan         15 to 34… Doctora…       2016            70         56800
#> # ℹ 12 more variables: med_income_5y <dbl>, num_graduates_prop <dbl>,
#> #   med_income_2y_prop <dbl>, med_income_5y_prop <dbl>,
#> #   ahead_1_med_income_5y_prop <dbl>, lag_0_med_income_5y_prop <dbl>,
#> #   lag_1_med_income_5y_prop <dbl>, lag_2_med_income_5y_prop <dbl>,
#> #   lag_0_med_income_2y_prop <dbl>, lag_1_med_income_2y_prop <dbl>,
#> #   lag_0_num_graduates_prop <dbl>, lag_1_num_graduates_prop <dbl>

Model fitting & post-processing

Before fitting our model and making predictions, let’s add some post-processing steps using a few frosting layers to do a few things:

  1. Threshold our predictions to 0. We are predicting proportions, which can’t be negative. And the transformed values back to dollars and people can’t be negative either.
  2. Generate prediction intervals based on residual quantiles, allowing us to quantify the uncertainty associated with future predicted values.
  3. Convert our predictions back to income values and number of graduates, rather than standardized proportions. We do this via the frosting layer layer_population_scaling().
# Create dataframe of the sums we used for standardizing
# Only have to include med_income_5y since that is our outcome
totals <- employ_small %>%
  group_by(geo_value, age_group, edu_qual) %>%
  summarise(med_income_5y_tot = sum(med_income_5y), .groups = "drop")

# Define post-processing steps
f <- frosting() %>%
  layer_predict() %>%
  layer_naomit(.pred) %>%
  layer_threshold(.pred, lower = 0) %>%
  # 90% prediction interval
  layer_residual_quantiles(
    quantile_levels = c(0.1, 0.9),
    symmetrize = FALSE
  ) %>%
  layer_population_scaling(
    .pred, .pred_distn,
    df = totals, df_pop_col = "med_income_5y_tot"
  )

wfx_linreg <- epi_workflow(rx, parsnip::linear_reg()) %>%
  fit(employ_small) %>%
  add_frosting(f)

summary(extract_fit_engine(wfx_linreg))
#> 
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.049668 -0.004509 -0.000516  0.004707  0.049882 
#> 
#> Coefficients:
#>                           Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)               0.041278   0.004975   8.298 4.72e-16 ***
#> lag_0_med_income_5y_prop  0.320780   0.049348   6.500 1.44e-10 ***
#> lag_1_med_income_5y_prop  0.079610   0.049116   1.621  0.10546    
#> lag_2_med_income_5y_prop  0.073048   0.033686   2.168  0.03043 *  
#> lag_0_med_income_2y_prop  0.118122   0.045579   2.592  0.00973 ** 
#> lag_1_med_income_2y_prop  0.034455   0.042749   0.806  0.42050    
#> lag_0_num_graduates_prop -0.025129   0.013603  -1.847  0.06509 .  
#> lag_1_num_graduates_prop  0.078268   0.013396   5.842 7.58e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.01056 on 773 degrees of freedom
#> Multiple R-squared:  0.3224, Adjusted R-squared:  0.3163 
#> F-statistic: 52.54 on 7 and 773 DF,  p-value: < 2.2e-16

Based on the summary output for this model, we can examine confidence intervals and perform hypothesis tests as usual.

Let’s take a look at the predictions along with their 90% prediction intervals.

latest <- get_test_data(recipe = rx, x = employ_small)
predsx <- predict(wfx_linreg, latest)

# Display predictions along with prediction intervals
predsx %>%
  select(
    geo_value, time_value, edu_qual, age_group,
    .pred_scaled, .pred_distn_scaled
  ) %>%
  head() %>%
  pivot_quantiles_wider(.pred_distn_scaled)
#> # A tibble: 6 × 7
#>   geo_value time_value edu_qual             age_group .pred_scaled  `0.1`  `0.9`
#>   <chr>          <int> <fct>                <fct>            <dbl>  <dbl>  <dbl>
#> 1 Alberta         2017 Career, technical o… 15 to 34…       46467. 42479. 5.12e4
#> 2 Alberta         2017 Career, technical o… 15 to 34…       57276. 52474. 6.29e4
#> 3 Alberta         2017 Post career, techni… 15 to 34…       75507. 69424. 8.27e4
#> 4 Alberta         2017 Undergraduate degree 15 to 34…       72974. 66815. 8.02e4
#> 5 Alberta         2017 Master's degree      15 to 34…       88015. 80610. 9.68e4
#> 6 Alberta         2017 Doctoral degree      15 to 34…       91589. 83825. 1.01e5

Using canned forecasters

We’ve seen what we can do with non-epidemiological panel data using the recipes frame, with epi_recipe for pre-processing, epi_workflow for model fitting, and frosting for post-processing.

epipredict also comes with canned forecasters that do all of those steps behind the scenes for some simple models. Even though we aren’t working with epidemiological data, canned forecasters still work as expected, out of the box. We will demonstrate this with the simple flatline_forecaster and the direct autoregressive (AR) forecaster arx_forecaster.

For both illustrations, we will continue to use the employ_small dataset with the transformed numeric columns that are proportions within each group by the keys in our epi_df.

Flatline forecaster

In this first example, we’ll use flatline_forecaster to make a simple prediction of the 2-year median income for the next year, based on one previous time point. This model is representated algebraically as: yt+1,ijk=ytijk+ϵtijky_{t+1,ijk} = y_{tijk} + \epsilon_{tijk} where ytijky_{tijk} is the 2-year median income (proportion) at time tt.

out_fl <- flatline_forecaster(employ_small, "med_income_2y_prop",
  args_list = flatline_args_list(ahead = 1)
)

out_fl
#> ══ A basic forecaster of type flatline ═════════════════════════════════════════
#> 
#> This forecaster was fit on 2024-11-13 16:03:23.
#> 
#> Training data was an <epi_df> with:
#> • Geography: custom,
#> • Other keys: age_group and edu_qual,
#> • Time type: integer,
#> • Using data up-to-date as of: 2022-07-19.
#> 
#> ── Predictions ─────────────────────────────────────────────────────────────────
#> 
#> A total of 169 predictions are available for
#> • 11 unique geographic regions,
#> • At forecast dates: 2017,
#> • For target dates: 2018.
#> 

Autoregressive forecaster with exogenous inputs

In this second example, we’ll use arx_forecaster to make a prediction of the 5-year median income based using two lags, and using two lags on two exogenous variables: 2-year median income and number of graduates.

The canned forecaster gives us a simple way of making this forecast since it defines the recipe, workflow, and post-processing steps behind the scenes. This is very similar to the model we introduced in the “Autoregressive Linear Model with Exogenous Inputs” section of this article, but where all inputs have the same number of lags.

arx_args <- arx_args_list(lags = c(0L, 1L), ahead = 1L)

out_arx_lr <- arx_forecaster(employ_small, "med_income_5y_prop",
  c("med_income_5y_prop", "med_income_2y_prop", "num_graduates_prop"),
  args_list = arx_args
)

out_arx_lr
#> ══ A basic forecaster of type ARX Forecaster ═══════════════════════════════════
#> 
#> This forecaster was fit on 2024-11-13 16:03:24.
#> 
#> Training data was an <epi_df> with:
#> • Geography: custom,
#> • Other keys: age_group and edu_qual,
#> • Time type: integer,
#> • Using data up-to-date as of: 2022-07-19.
#> 
#> ── Predictions ─────────────────────────────────────────────────────────────────
#> 
#> A total of 168 predictions are available for
#> • 11 unique geographic regions,
#> • At forecast dates: 2017,
#> • For target dates: 2018.
#> 

Other changes to the direct AR forecaster, like changing the engine, also work as expected. Below we use a boosted tree model instead of a linear regression.

out_arx_rf <- arx_forecaster(
  employ_small, "med_income_5y_prop",
  c("med_income_5y_prop", "med_income_2y_prop", "num_graduates_prop"),
  trainer = parsnip::boost_tree(mode = "regression", trees = 20),
  args_list = arx_args
)

out_arx_rf
#> ══ A basic forecaster of type ARX Forecaster ═══════════════════════════════════
#> 
#> This forecaster was fit on 2024-11-13 16:03:24.
#> 
#> Training data was an <epi_df> with:
#> • Geography: custom,
#> • Other keys: age_group and edu_qual,
#> • Time type: integer,
#> • Using data up-to-date as of: 2022-07-19.
#> 
#> ── Predictions ─────────────────────────────────────────────────────────────────
#> 
#> A total of 168 predictions are available for
#> • 11 unique geographic regions,
#> • At forecast dates: 2017,
#> • For target dates: 2018.
#> 

Conclusion

While the purpose of epipredict is to allow tidymodels to operate on epidemiology data, it can be easily adapted (both the workflows and the canned forecasters) to work for generic panel data modelling.