Skip to contents

Slides a given function over variables in an epi_df object. See the slide vignette for examples.

Usage

epi_slide(
  .x,
  .f,
  ...,
  .window_size = NULL,
  .align = c("right", "center", "left"),
  .ref_time_values = NULL,
  .new_col_name = NULL,
  .all_rows = FALSE
)

Arguments

.x

The epi_df object under consideration, grouped or ungrouped. If ungrouped, all data in .x will be treated as part of a single data group.

.f

Function, formula, or missing; together with ... specifies the computation to slide. To "slide" means to apply a computation within a sliding (a.k.a. "rolling") time window for each data group. The window is determined by the .window_size and .align parameters, see the details section for more. If a function, .f must have the form function(x, g, t, ...), where

  • x is a data frame with the same column names as the original object, minus any grouping variables, with only the windowed data for one group-.ref_time_value combination

  • g is a one-row tibble containing the values of the grouping variables for the associated group

  • t is the .ref_time_value for the current window

  • ... are additional arguments

If a formula, .f can operate directly on columns accessed via .x$var or .$var, as in ~mean(.x$var) to compute a mean of a column var for each ref_time_value-group combination. The group key can be accessed via .y. If .f is missing, then ... will specify the computation.

...

Additional arguments to pass to the function or formula specified via .f. Alternatively, if .f is missing, then the ... is interpreted as a "data-masking" expression or expressions for tidy evaluation; in addition to referring columns directly by name, the expressions have access to .data and .env pronouns as in dplyr verbs, and can also refer to .x (not the same as the input epi_df), .group_key, and .ref_time_value. See details.

.window_size

The size of the sliding window. By default, this is 1, meaning that only the current ref_time_value is included. The accepted values here depend on the time_value column:

  • if time_type is Date and the cadence is daily, then .window_size can be an integer (which will be interpreted in units of days) or a difftime with units "days"

  • if time_type is Date and the cadence is weekly, then .window_size must be a difftime with units "weeks"

  • if time_type is an integer, then .window_size must be an integer

.align

The alignment of the sliding window. If right (default), then the window has its end at the reference time; if center, then the window is centered at the reference time; if left, then the window has its start at the reference time. If the alignment is center and the window size is odd, then the window will have floor(window_size/2) points before and after the reference time. If the window size is even, then the window will be asymmetric and have one less value on the right side of the reference time (assuming time increases from left to right).

.ref_time_values

Time values for sliding computations, meaning, each element of this vector serves as the reference time point for one sliding window. If missing, then this will be set to all unique time values in the underlying data table, by default.

.new_col_name

String indicating the name of the new column that will contain the derivative values. The default is "slide_value" unless your slide computations output data frames, in which case they will be unpacked into the constituent columns and those names used. New columns should not be given names that clash with the existing columns of .x; see details.

.all_rows

If .all_rows = TRUE, then all rows of .x will be kept in the output even with .ref_time_values provided, with some type of missing value marker for the slide computation output column(s) for time_values outside .ref_time_values; otherwise, there will be one row for each row in .x that had a time_value in .ref_time_values. Default is FALSE. The missing value marker is the result of vctrs::vec_casting NA to the type of the slide computation output.

Value

An epi_df object given by appending one or more new columns to .x, named according to the .new_col_name argument.

Details

To "slide" means to apply a function or formula over a rolling window. The .window_size arg determines the width of the window (including the reference time) and the .align arg governs how the window is aligned (see below for examples). The .ref_time_values arg controls which time values to consider for the slide and .all_rows allows you to keep NAs around.

epi_slide() does not require a complete window (such as on the left boundary of the dataset) and will attempt to perform the computation anyway. The issue of what to do with partial computations (those run on incomplete windows) is therefore left up to the user, either through the specified function or formula, or through post-processing.

Let's look at some window examples, assuming that the reference time value is "tv". With .align = "right" and .window_size = 3, the window will be:

time_values: tv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3 window: tv - 2, tv - 1, tv

With .align = "center" and .window_size = 3, the window will be:

time_values: tv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3 window: tv - 1, tv, tv + 1

With .align = "center" and .window_size = 4, the window will be:

time_values: tv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3 window: tv - 2, tv - 1, tv, tv + 1

With .align = "left" and .window_size = 3, the window will be:

time_values: ttv - 3, tv - 2, tv - 1, tv, tv + 1, tv + 2, tv + 3 window: tv, tv + 1, tv + 2

If .f is missing, then "data-masking" expression(s) for tidy evaluation can be specified, for example, as in:

epi_slide(x, cases_7dav = mean(cases), .window_size = 7)

which would be equivalent to:

epi_slide(x, function(x, g, t) mean(x$cases), .window_size = 7,
          .new_col_name = "cases_7dav")

In a manner similar to dplyr::mutate:

  • Expressions evaluating to length-1 vectors will be recycled to appropriate lengths.

  • , name_var := value can be used to set the output column name based on a variable name_var rather than requiring you to use a hard-coded name. (The leading comma is needed to make sure that .f is treated as missing.)

  • = NULL can be used to remove results from previous expressions (though we don't allow it to remove pre-existing columns).

  • , fn_returning_a_data_frame(.x) will unpack the output of the function into multiple columns in the result.

  • Named expressions evaluating to data frames will be placed into tidyr::packed columns.

In addition to .data and .env, we make some additional "pronoun"-like bindings available:

  • .x, which is like .x in dplyr::group_modify; an ordinary object like an epi_df rather than an rlang pronoun like .data; this allows you to use additional dplyr, tidyr, and epiprocess operations. If you have multiple expressions in ..., this won't let you refer to the output of the earlier expressions, but .data will.

  • .group_key, which is like .y in dplyr::group_modify.

  • .ref_time_value, which is the element of .ref_time_values that determined the time window for the current computation.

Examples

# slide a 7-day trailing average formula on cases
# Simple sliding means and sums are much faster to do using
# the `epi_slide_mean` and `epi_slide_sum` functions instead.
jhu_csse_daily_subset %>%
  group_by(geo_value) %>%
  epi_slide(cases_7dav = mean(cases), .window_size = 7) %>%
  dplyr::select(geo_value, time_value, cases, cases_7dav) %>%
  ungroup()
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2024-08-23 02:40:48.296938
#> 
#> # A tibble: 4,026 × 4
#>    geo_value time_value cases cases_7dav
#>  * <chr>     <date>     <dbl>      <dbl>
#>  1 ca        2020-03-01     6       NA  
#>  2 ca        2020-03-02     4       NA  
#>  3 ca        2020-03-03     6       NA  
#>  4 ca        2020-03-04    11       NA  
#>  5 ca        2020-03-05    10       NA  
#>  6 ca        2020-03-06    18       NA  
#>  7 ca        2020-03-07    26       11.6
#>  8 ca        2020-03-08    19       13.4
#>  9 ca        2020-03-09    23       16.1
#> 10 ca        2020-03-10    22       18.4
#> # ℹ 4,016 more rows

# slide a 7-day leading average
jhu_csse_daily_subset %>%
  group_by(geo_value) %>%
  epi_slide(cases_7dav = mean(cases), .window_size = 7, .align = "left") %>%
  dplyr::select(geo_value, time_value, cases, cases_7dav) %>%
  ungroup()
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2024-08-23 02:40:48.296938
#> 
#> # A tibble: 4,026 × 4
#>    geo_value time_value cases cases_7dav
#>  * <chr>     <date>     <dbl>      <dbl>
#>  1 ca        2020-03-01     6       11.6
#>  2 ca        2020-03-02     4       13.4
#>  3 ca        2020-03-03     6       16.1
#>  4 ca        2020-03-04    11       18.4
#>  5 ca        2020-03-05    10       20.4
#>  6 ca        2020-03-06    18       25.1
#>  7 ca        2020-03-07    26       30.1
#>  8 ca        2020-03-08    19       34.4
#>  9 ca        2020-03-09    23       37.3
#> 10 ca        2020-03-10    22       56.7
#> # ℹ 4,016 more rows

# slide a 7-day center-aligned average
jhu_csse_daily_subset %>%
  group_by(geo_value) %>%
  epi_slide(cases_7dav = mean(cases), .window_size = 7, .align = "center") %>%
  dplyr::select(geo_value, time_value, cases, cases_7dav) %>%
  ungroup()
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2024-08-23 02:40:48.296938
#> 
#> # A tibble: 4,026 × 4
#>    geo_value time_value cases cases_7dav
#>  * <chr>     <date>     <dbl>      <dbl>
#>  1 ca        2020-03-01     6       NA  
#>  2 ca        2020-03-02     4       NA  
#>  3 ca        2020-03-03     6       NA  
#>  4 ca        2020-03-04    11       11.6
#>  5 ca        2020-03-05    10       13.4
#>  6 ca        2020-03-06    18       16.1
#>  7 ca        2020-03-07    26       18.4
#>  8 ca        2020-03-08    19       20.4
#>  9 ca        2020-03-09    23       25.1
#> 10 ca        2020-03-10    22       30.1
#> # ℹ 4,016 more rows

# slide a 14-day center-aligned average
jhu_csse_daily_subset %>%
  group_by(geo_value) %>%
  epi_slide(cases_14dav = mean(cases), .window_size = 14, .align = "center") %>%
  dplyr::select(geo_value, time_value, cases, cases_14dav) %>%
  ungroup()
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2024-08-23 02:40:48.296938
#> 
#> # A tibble: 4,026 × 4
#>    geo_value time_value cases cases_14dav
#>  * <chr>     <date>     <dbl>       <dbl>
#>  1 ca        2020-03-01     6        NA  
#>  2 ca        2020-03-02     4        NA  
#>  3 ca        2020-03-03     6        NA  
#>  4 ca        2020-03-04    11        NA  
#>  5 ca        2020-03-05    10        NA  
#>  6 ca        2020-03-06    18        NA  
#>  7 ca        2020-03-07    26        NA  
#>  8 ca        2020-03-08    19        23  
#>  9 ca        2020-03-09    23        25.4
#> 10 ca        2020-03-10    22        36.4
#> # ℹ 4,016 more rows

# nested new columns
jhu_csse_daily_subset %>%
  group_by(geo_value) %>%
  epi_slide(
    cases_2d = list(data.frame(
      cases_2dav = mean(cases),
      cases_2dma = mad(cases)
    )),
    .window_size = 2
  ) %>%
  ungroup()
#> An `epi_df` object, 4,026 x 7 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2024-08-23 02:40:48.296938
#> 
#> # A tibble: 4,026 × 7
#>    geo_value time_value cases cases_7d_av case_rate_7d_av death_rate_7d_av
#>  * <chr>     <date>     <dbl>       <dbl>           <dbl>            <dbl>
#>  1 ca        2020-03-01     6        1.29         0.00327         0       
#>  2 ca        2020-03-02     4        1.71         0.00435         0       
#>  3 ca        2020-03-03     6        2.43         0.00617         0       
#>  4 ca        2020-03-04    11        3.86         0.00980         0.000363
#>  5 ca        2020-03-05    10        5.29         0.0134          0.000363
#>  6 ca        2020-03-06    18        7.86         0.0200          0.000363
#>  7 ca        2020-03-07    26       11.6          0.0294          0.000363
#>  8 ca        2020-03-08    19       13.4          0.0341          0.000363
#>  9 ca        2020-03-09    23       16.1          0.0410          0.000726
#> 10 ca        2020-03-10    22       18.4          0.0468          0.000726
#> # ℹ 4,016 more rows
#> # ℹ 1 more variable: cases_2d <list>