This package provides tools for building COVID-19 models, with a focus on forecasting and hotspot prediction models. It is designed to be used in conjunction with the covidcast and evalcast packages.

Installing

This package is not on CRAN yet, so it can be installed using the devtools package:

devtools::install_github("cmu-delphi/covidcast", ref = "main",
                         subdir = "R-packages/modeltools")

Building the vignettes, such as this getting started guide, takes a significant amount of time. They are not included in the package by default. If you want to include vignettes, then use this modified command:

devtools::install_github("cmu-delphi/covidcast", ref = "main",
                         subdir = "R-packages/modeltools",
                         build_vignettes = TRUE, dependencies = TRUE)

For this getting started vignette, we’ll fetch COVID-19 case rates, both raw and smoothed via 7-day trailing averages, from the USAFacts data source. We’ll look at data at the state level for four states, between the start of June and mid November.

library(covidcast)
## We encourage COVIDcast API users to register on our mailing list:
## https://lists.andrew.cmu.edu/mailman/listinfo/delphi-covidcast-api
## We'll send announcements about new data sources, package updates,
## server maintenance, and new features.
start_day <- "2020-06-01"
end_day <- "2020-11-15"
geo_values <- c("ca", "fl", "ny", "tx")

signals <- suppressMessages(
  covidcast_signals(data_source = "usa-facts",
                    signal = c("confirmed_incidence_prop",
                               "confirmed_7dav_incidence_prop"),
                    start_day = start_day, end_day = end_day,
                    geo_type = "state", geo_values = geo_values))

summary(signals[[1]])
## A `covidcast_signal` data frame with 672 rows and 9 columns.
## 
## data_source : usa-facts
## signal      : confirmed_incidence_prop
## geo_type    : state
## 
## first date                          : 2020-06-01
## last date                           : 2020-11-15
## median number of geo_values per day : 4
summary(signals[[2]])
## A `covidcast_signal` data frame with 672 rows and 9 columns.
## 
## data_source : usa-facts
## signal      : confirmed_7dav_incidence_prop
## geo_type    : state
## 
## first date                          : 2020-06-01
## last date                           : 2020-11-15
## median number of geo_values per day : 4

Slide with a formula

One of the most basic tools in the modeltools package is slide_by_geo(), which is based on the family of functions provided by the slider package. In modeltools, to “slide” means to apply the function or formula over a trailing window of n days of data, grouped by geo_value. Many other functions in the package, such as pct_change() and estimate_deriv(), use slide_by_geo() as their workhorse.

For example, to apply a 7-day trailing average to the raw values in the first covidcast_signal data frame, we can specify a formula via the slide_fun argument of slide_by_geo():

library(modeltools)
library(dplyr)

slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7) %>%
  select(geo_value, time_value, value, slide_value) %>%
  arrange(geo_value) %>%
  head(10)
## # A tibble: 10 x 4
##    geo_value time_value value slide_value
##    <chr>     <date>     <dbl>       <dbl>
##  1 ca        2020-06-01  6.34        6.34
##  2 ca        2020-06-02  5.86        6.10
##  3 ca        2020-06-03  5.88        6.03
##  4 ca        2020-06-04  7.97        6.51
##  5 ca        2020-06-05  8.40        6.89
##  6 ca        2020-06-06  6.71        6.86
##  7 ca        2020-06-07  6.03        6.74
##  8 ca        2020-06-08  6.42        6.75
##  9 ca        2020-06-09  7.54        6.99
## 10 ca        2020-06-10  7.27        7.19

The formula specified via slide_fun has access to all columns present in the original covidcast_signal data frame, and must refer to them with the prefix .x$. Here the function Mean() is a simple wrapper around mean() that omits NA values by default (provided by the modeltools package).

Notice that slide_by_geo() returns a data frame with a new column appended that contains the results (from sliding the formula), named “slide_value” by default. We can instead specify a name up front using the col_name argument:

slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7,
             col_name = "7dav") %>%
  select(geo_value, time_value, value, `7dav`) %>%
  arrange(geo_value) %>%
  head(10)
## # A tibble: 10 x 4
##    geo_value time_value value `7dav`
##    <chr>     <date>     <dbl>  <dbl>
##  1 ca        2020-06-01  6.34   6.34
##  2 ca        2020-06-02  5.86   6.10
##  3 ca        2020-06-03  5.88   6.03
##  4 ca        2020-06-04  7.97   6.51
##  5 ca        2020-06-05  8.40   6.89
##  6 ca        2020-06-06  6.71   6.86
##  7 ca        2020-06-07  6.03   6.74
##  8 ca        2020-06-08  6.42   6.75
##  9 ca        2020-06-09  7.54   6.99
## 10 ca        2020-06-10  7.27   7.19

As a simple sanity check, we compare the 7-day trailing average computed via slide_by_geo() to the values of the smoothed signal from the API:

slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7,
             col_name = "7dav") %>%
  full_join(signals[[2]], by = c("geo_value", "time_value")) %>%
  mutate(difference = `7dav` - value.y) %>%
  filter(time_value >= "2020-06-07") %>%
  summarize(max(abs(difference)))
## # A tibble: 1 x 1
##   `max(abs(difference))`
##                    <dbl>
## 1      0.000000000000718

Note that this check would fail before June 7 because the API access to data before June 1, whereas slide_by_geo() does not (when a trailing window of n days isn’t available, slide_by_geo() just applies the function or formula to whatever data is available).

Slide with a function

We can also pass a function for the slide_fun argument in slide_by_geo(). In this case, the passed function must have the following argument structure: x, a data frame the same column names as the original data frame; followed by any number of named additional arguments; and ending with ..., to capture general additional arguments. Recreating the last example of a 7-day trailing average:

slide_by_geo(signals[[1]], slide_fun = function(x, ...) Mean(x$value), n = 7,
             col_name = "7dav") %>%
  select(geo_value, time_value, value, `7dav`) %>%
  arrange(geo_value) %>%
  head(10)
## # A tibble: 10 x 4
##    geo_value time_value value `7dav`
##    <chr>     <date>     <dbl>  <dbl>
##  1 ca        2020-06-01  6.34   6.34
##  2 ca        2020-06-02  5.86   6.10
##  3 ca        2020-06-03  5.88   6.03
##  4 ca        2020-06-04  7.97   6.51
##  5 ca        2020-06-05  8.40   6.89
##  6 ca        2020-06-06  6.71   6.86
##  7 ca        2020-06-07  6.03   6.74
##  8 ca        2020-06-08  6.42   6.75
##  9 ca        2020-06-09  7.54   6.99
## 10 ca        2020-06-10  7.27   7.19

As a more sophisticated example, here we show how to sensorize the doctor visits signal. The sensor values are defined by the predicted values from a local (in time) regression of past case rates (response) on past doctor visits (covariate).

# Regression for doctor visits sensorization
dv_regression = function(x, m = 3, ...) {
  n = nrow(x)
  if (n <= m+1) return(NA) # Take care of trivial case

  return(tryCatch(suppressWarnings(suppressMessages({
    # Fit a regression, leaving out the last m days of data
    lm_obj = lm(value.y ~ value.x, data = x[1:(n-m+1), ])

    # Form prediction for the last day of data, and return
    predict(lm_obj, newdata = x[n, ])
  })),
  error = function(e) return(NA)))
}

# Fetch doctor visits signal and join it to case rates
joined = suppressMessages(
  covidcast_signal(data_source = "doctor-visits",
                   signal = "smoothed_adj_cli",
                   start_day = start_day, end_day = end_day,
                   geo_type = "state", geo_values = geo_values)) %>%
  full_join(signals[[2]], by = c("geo_value", "time_value"))

# Perform sensorization for each state; use the last n = 56 days (8 weeks) of 
# data, minus the last m = 3 days, for the regression
slide_by_geo(joined, slide_fun = dv_regression, n = 56, m = 3,
             col_name = "sensor") %>%
  select(geo_value, time_value, value.x, value.y, sensor) %>%
  arrange(geo_value) %>%
  head(10)
## # A tibble: 10 x 5
##    geo_value time_value value.x value.y sensor
##    <chr>     <date>       <dbl>   <dbl>  <dbl>
##  1 ca        2020-06-01    2.75    6.70  NA   
##  2 ca        2020-06-02    2.59    6.39  NA   
##  3 ca        2020-06-03    2.48    6.50  NA   
##  4 ca        2020-06-04    2.41    6.86  NA   
##  5 ca        2020-06-05    2.57    7.11   6.50
##  6 ca        2020-06-06    2.63    6.81   6.59
##  7 ca        2020-06-07    2.73    6.74   6.68
##  8 ca        2020-06-08    3.04    6.75   6.68
##  9 ca        2020-06-09    2.97    6.99   6.71
## 10 ca        2020-06-10    2.99    7.19   6.74

Above, the first 4 elements are NA because of insufficient training data (we need at least 2 training samples for simple linear regression, and we omit the last 3 days of data).

Returning complex objects

As a final example, we show how to use slide_by_geo() to output more complex objects. We amend the last example so that dv_regression() returns both the predicted value (sensor) and the fitted linear model object. In order to use this with slide_by_geo(), we need to set the col_type argument to be “list”:

dv_regression = function(x, m = 3, ...) {
  n = nrow(x)
  if (n <= m+1) return(list(lm_obj = NA, sensor = NA)) # Trivial case

  return(tryCatch(suppressWarnings(suppressMessages({
    # Fit a regression, leaving out the last days of data
    n = nrow(x)
    lm_obj = lm(value.y ~ value.x, data = x[1:(n-m+1), ])

    # Form prediction for the last day of data
    sensor = predict(lm_obj, newdata = x[n, ])

    # Return the fitted lm object and prediction
    list(lm_obj = lm_obj, sensor = sensor)
  })),
  error = function(e) return(list(lm_obj = NA, sensor = NA))))
}

joined <- slide_by_geo(joined, slide_fun = dv_regression, n = 56, m = 3,
                       col_name = "sensor_obj", col_type = "list")

class(joined$sensor_obj)
## [1] "list"
names(joined$sensor_obj[[5]]) # The first 4 are lists filled with NAs
## [1] "lm_obj" "sensor"
joined$sensor_obj[[5]]$lm_obj
## 
## Call:
## lm(formula = value.y ~ value.x, data = x[1:(n - m + 1), ])
## 
## Coefficients:
## (Intercept)      value.x  
##      4.3362       0.8409
joined$sensor_obj[[5]]$sensor
##        1 
## 6.498118

This allows for post-inspection of the sensor models, which may be helpful for diagnostic purposes. Note the way in which we structure the return argument in dv_regression() in its failure cases: it is always a list with the same named arguments. This helps keep the post-inspection code just a bit more simple (it is already a hairy enough, to protect against extracting coefficients from NA objects).

joined %>%
  rowwise() %>%
  mutate(sensor = sensor_obj$sensor,
         intercept = ifelse(isTRUE(is.na(sensor_obj$lm_obj)), NA,
                            coef(sensor_obj$lm_obj)[1]),
         slope = ifelse(isTRUE(is.na(sensor_obj$lm_obj)), NA,
                        coef(sensor_obj$lm_obj)[2])) %>%
  select(geo_value, time_value, sensor, intercept, slope) %>%
  arrange(geo_value) %>%
  head(20)
## # A tibble: 20 x 5
## # Rowwise: 
##    geo_value time_value sensor intercept   slope
##    <chr>     <date>      <dbl>     <dbl>   <dbl>
##  1 ca        2020-06-01  NA        NA    NA     
##  2 ca        2020-06-02  NA        NA    NA     
##  3 ca        2020-06-03  NA        NA    NA     
##  4 ca        2020-06-04  NA        NA    NA     
##  5 ca        2020-06-05   6.50      4.34  0.841 
##  6 ca        2020-06-06   6.59      7.27 -0.256 
##  7 ca        2020-06-07   6.68      7.18 -0.181 
##  8 ca        2020-06-08   6.68      6.97 -0.0937
##  9 ca        2020-06-09   6.71      6.87 -0.0535
## 10 ca        2020-06-10   6.74      6.69  0.0159
## 11 ca        2020-06-11   6.83      6.18  0.218 
## 12 ca        2020-06-12   6.92      5.61  0.442 
## 13 ca        2020-06-13   7.01      5.12  0.633 
## 14 ca        2020-06-14   7.14      4.64  0.820 
## 15 ca        2020-06-15   7.50      4.07  1.04  
## 16 ca        2020-06-16   7.66      3.52  1.25  
## 17 ca        2020-06-17   7.79      3.18  1.38  
## 18 ca        2020-06-18   7.99      2.87  1.49  
## 19 ca        2020-06-19   8.43      2.25  1.72  
## 20 ca        2020-06-20   8.67      1.84  1.87