Slides a given function over variables in an epi_df
object.
This is useful for computations like rolling averages. The function supports
many ways to specify the computation, but by far the most common use case is
as follows:
# Create new column `cases_7dmed` that contains a 7-day trailing median of cases
epi_slide(edf, cases_7dmed = median(cases), .window_size = 7)
For two very common use cases, we provide optimized functions that are much
faster than epi_slide
: epi_slide_mean()
and epi_slide_sum()
. We
recommend using these functions when possible.
See vignette("epi_df")
for more examples.
Usage
epi_slide(
.x,
.f,
...,
.window_size = NULL,
.align = c("right", "center", "left"),
.ref_time_values = NULL,
.new_col_name = NULL,
.all_rows = FALSE
)
Arguments
- .x
An
epi_df
object. If ungrouped, we temporarily group bygeo_value
and any columns inother_keys
. If grouped, we make sure the grouping is bygeo_value
andother_keys
.- .f
Function, formula, or missing; together with
...
specifies the computation to slide. The return of the computation should either be a scalar or a 1-row data frame. Data frame returns will betidyr::unpack()
-ed, if named, and will betidyr::pack
-ed columns, if not named. See examples.If
.f
is missing, then...
will specify the computation via tidy-evaluation. This is usually the most convenient way to useepi_slide
. See examples.If
.f
is a formula, then the formula should use.x
(not the same as the inputepi_df
) to operate on the columns of the inputepi_df
, e.g.~mean(.x$var)
to compute a mean ofvar
.If a function,
.f
must have the formfunction(x, g, t, ...)
, where:x
is a data frame with the same column names as the original object, minus any grouping variables, with only the windowed data for one group-.ref_time_value
combinationg
is a one-row tibble containing the values of the grouping variables for the associated groupt
is the.ref_time_value
for the current window...
are additional arguments
- ...
Additional arguments to pass to the function or formula specified via
.f
. Alternatively, if.f
is missing, then the...
is interpreted as a "data-masking" expression or expressions for tidy evaluation.- .window_size
The size of the sliding window. The accepted values depend on the type of the
time_value
column in.x
:if time type is
Date
and the cadence is daily, then.window_size
can be an integer (which will be interpreted in units of days) or a difftime with units "days"if time type is
Date
and the cadence is weekly, then.window_size
must be adifftime
with units "weeks"if time type is a
yearmonth
or an integer, then.window_size
must be an integer
- .align
The alignment of the sliding window.
If "right" (default), then the window has its end at the reference time. This is likely the most common use case, e.g.
.window_size=7
and.align="right"
slides over the past week of data.If "left", then the window has its start at the reference time.
If "center", then the window is centered at the reference time. If the window size is odd, then the window will have floor(window_size/2) points before and after the reference time; if the window size is even, then the window will be asymmetric and have one more value before the reference time than after.
- .ref_time_values
The time values at which to compute the slides values. By default, this is all the unique time values in
.x
.- .new_col_name
Name for the new column that will contain the computed values. The default is "slide_value" unless your slide computations output data frames, in which case they will be unpacked (as in
tidyr::unpack()
) into the constituent columns and those names used. New columns should not be given names that clash with the existing columns of.x
.- .all_rows
If
.all_rows = FALSE
, the default, then the outputepi_df
will have only the rows that had atime_value
in.ref_time_values
. Otherwise, all the rows from.x
are included by with a missing value marker (typically NA, but more technically the result ofvctrs::vec_cast
-ingNA
to the type of the slide computation output).
Value
An epi_df
object with one or more new slide computation columns
added. It will be ungrouped if .x
was ungrouped, and have the same groups
as .x
if .x
was grouped.
Details
Advanced uses of .f
via tidy evaluation
If specifying .f
via tidy evaluation, in addition to the standard .data
and .env
, we make some additional "pronoun"-like bindings available:
.x, which is like
.x
indplyr::group_modify
; an ordinary object like anepi_df
rather than an rlang pronoun like.data
; this allows you to use additionaldplyr
,tidyr
, andepiprocess
operations. If you have multiple expressions in...
, this won't let you refer to the output of the earlier expressions, but.data
will..group_key, which is like
.y
indplyr::group_modify
..ref_time_value, which is the element of
.ref_time_values
that determined the time window for the current computation.
See also
epi_slide_opt
for optimized slide functions
Examples
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
# Get the 7-day trailing standard deviation of cases and the 7-day trailing mean of cases
cases_deaths_subset %>%
epi_slide(
cases_7sd = sd(cases, na.rm = TRUE),
cases_7dav = mean(cases, na.rm = TRUE),
.window_size = 7
) %>%
select(geo_value, time_value, cases, cases_7sd, cases_7dav)
#> An `epi_df` object, 4,026 x 5 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 5
#> geo_value time_value cases cases_7sd cases_7dav
#> * <chr> <date> <dbl> <dbl> <dbl>
#> 1 ca 2020-03-01 6 NA 6
#> 2 ca 2020-03-02 4 1.41 5
#> 3 ca 2020-03-03 6 1.15 5.33
#> 4 ca 2020-03-04 11 2.99 6.75
#> 5 ca 2020-03-05 10 2.97 7.4
#> 6 ca 2020-03-06 18 5.08 9.17
#> 7 ca 2020-03-07 26 7.87 11.6
#> 8 ca 2020-03-08 19 7.87 13.4
#> 9 ca 2020-03-09 23 7.34 16.1
#> 10 ca 2020-03-10 22 6.02 18.4
#> # ℹ 4,016 more rows
# Note that epi_slide_mean could be used to more quickly calculate cases_7dav.
# In addition to the [`dplyr::mutate`]-like syntax, you can feed in a function or
# formula in a way similar to [`dplyr::group_modify`]:
my_summarizer <- function(window_data) {
window_data %>%
summarize(
cases_7sd = sd(cases, na.rm = TRUE),
cases_7dav = mean(cases, na.rm = TRUE)
)
}
cases_deaths_subset %>%
epi_slide(
~ my_summarizer(.x),
.window_size = 7
) %>%
select(geo_value, time_value, cases, cases_7sd, cases_7dav)
#> An `epi_df` object, 4,026 x 5 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 5
#> geo_value time_value cases cases_7sd cases_7dav
#> * <chr> <date> <dbl> <dbl> <dbl>
#> 1 ca 2020-03-01 6 NA 6
#> 2 ca 2020-03-02 4 1.41 5
#> 3 ca 2020-03-03 6 1.15 5.33
#> 4 ca 2020-03-04 11 2.99 6.75
#> 5 ca 2020-03-05 10 2.97 7.4
#> 6 ca 2020-03-06 18 5.08 9.17
#> 7 ca 2020-03-07 26 7.87 11.6
#> 8 ca 2020-03-08 19 7.87 13.4
#> 9 ca 2020-03-09 23 7.34 16.1
#> 10 ca 2020-03-10 22 6.02 18.4
#> # ℹ 4,016 more rows
#### Advanced: ####
# The tidyverse supports ["packing"][tidyr::pack] multiple columns into a
# single tibble-type column contained within some larger tibble. Like dplyr,
# we normally don't pack output columns together. However, packing behavior can be turned on
# by providing a name for a tibble-type output:
cases_deaths_subset %>%
epi_slide(
slide_packed = tibble(
cases_7sd = sd(.x$cases, na.rm = TRUE),
cases_7dav = mean(.x$cases, na.rm = TRUE)
),
.window_size = 7
) %>%
select(geo_value, time_value, cases, slide_packed)
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 4
#> geo_value time_value cases slide_packed$cases_7sd $cases_7dav
#> * <chr> <date> <dbl> <dbl> <dbl>
#> 1 ca 2020-03-01 6 NA 6
#> 2 ca 2020-03-02 4 1.41 5
#> 3 ca 2020-03-03 6 1.15 5.33
#> 4 ca 2020-03-04 11 2.99 6.75
#> 5 ca 2020-03-05 10 2.97 7.4
#> 6 ca 2020-03-06 18 5.08 9.17
#> 7 ca 2020-03-07 26 7.87 11.6
#> 8 ca 2020-03-08 19 7.87 13.4
#> 9 ca 2020-03-09 23 7.34 16.1
#> 10 ca 2020-03-10 22 6.02 18.4
#> # ℹ 4,016 more rows
cases_deaths_subset %>%
epi_slide(
~ tibble(
cases_7sd = sd(.x$cases, na.rm = TRUE),
cases_7dav = mean(.x$cases, na.rm = TRUE)
),
.new_col_name = "slide_packed",
.window_size = 7
) %>%
select(geo_value, time_value, cases, slide_packed)
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 4
#> geo_value time_value cases slide_packed$cases_7sd $cases_7dav
#> * <chr> <date> <dbl> <dbl> <dbl>
#> 1 ca 2020-03-01 6 NA 6
#> 2 ca 2020-03-02 4 1.41 5
#> 3 ca 2020-03-03 6 1.15 5.33
#> 4 ca 2020-03-04 11 2.99 6.75
#> 5 ca 2020-03-05 10 2.97 7.4
#> 6 ca 2020-03-06 18 5.08 9.17
#> 7 ca 2020-03-07 26 7.87 11.6
#> 8 ca 2020-03-08 19 7.87 13.4
#> 9 ca 2020-03-09 23 7.34 16.1
#> 10 ca 2020-03-10 22 6.02 18.4
#> # ℹ 4,016 more rows
# You can also get ["nested"][tidyr::nest] format by wrapping your results in
# a list:
cases_deaths_subset %>%
group_by(geo_value) %>%
epi_slide(
function(x, g, t) {
list(tibble(
cases_7sd = sd(x$cases, na.rm = TRUE),
cases_7dav = mean(x$cases, na.rm = TRUE)
))
},
.window_size = 7
) %>%
ungroup() %>%
select(geo_value, time_value, slide_value)
#> An `epi_df` object, 4,026 x 3 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 3
#> geo_value time_value slide_value
#> * <chr> <date> <list>
#> 1 ca 2020-03-01 <tibble [1 × 2]>
#> 2 ca 2020-03-02 <tibble [1 × 2]>
#> 3 ca 2020-03-03 <tibble [1 × 2]>
#> 4 ca 2020-03-04 <tibble [1 × 2]>
#> 5 ca 2020-03-05 <tibble [1 × 2]>
#> 6 ca 2020-03-06 <tibble [1 × 2]>
#> 7 ca 2020-03-07 <tibble [1 × 2]>
#> 8 ca 2020-03-08 <tibble [1 × 2]>
#> 9 ca 2020-03-09 <tibble [1 × 2]>
#> 10 ca 2020-03-10 <tibble [1 × 2]>
#> # ℹ 4,016 more rows
# Use the geo_value or the ref_time_value in the slide computation
cases_deaths_subset %>%
epi_slide(~ .x$geo_value[[1]], .window_size = 7)
#> An `epi_df` object, 4,026 x 7 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 7
#> geo_value time_value case_rate_7d_av death_rate_7d_av cases cases_7d_av
#> * <chr> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 ca 2020-03-01 0.00327 0 6 1.29
#> 2 ca 2020-03-02 0.00435 0 4 1.71
#> 3 ca 2020-03-03 0.00617 0 6 2.43
#> 4 ca 2020-03-04 0.00980 0.000363 11 3.86
#> 5 ca 2020-03-05 0.0134 0.000363 10 5.29
#> 6 ca 2020-03-06 0.0200 0.000363 18 7.86
#> 7 ca 2020-03-07 0.0294 0.000363 26 11.6
#> 8 ca 2020-03-08 0.0341 0.000363 19 13.4
#> 9 ca 2020-03-09 0.0410 0.000726 23 16.1
#> 10 ca 2020-03-10 0.0468 0.000726 22 18.4
#> # ℹ 4,016 more rows
#> # ℹ 1 more variable: slide_value <chr>
cases_deaths_subset %>%
epi_slide(~ .x$time_value[[1]], .window_size = 7)
#> An `epi_df` object, 4,026 x 7 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 7
#> geo_value time_value case_rate_7d_av death_rate_7d_av cases cases_7d_av
#> * <chr> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 ca 2020-03-01 0.00327 0 6 1.29
#> 2 ca 2020-03-02 0.00435 0 4 1.71
#> 3 ca 2020-03-03 0.00617 0 6 2.43
#> 4 ca 2020-03-04 0.00980 0.000363 11 3.86
#> 5 ca 2020-03-05 0.0134 0.000363 10 5.29
#> 6 ca 2020-03-06 0.0200 0.000363 18 7.86
#> 7 ca 2020-03-07 0.0294 0.000363 26 11.6
#> 8 ca 2020-03-08 0.0341 0.000363 19 13.4
#> 9 ca 2020-03-09 0.0410 0.000726 23 16.1
#> 10 ca 2020-03-10 0.0468 0.000726 22 18.4
#> # ℹ 4,016 more rows
#> # ℹ 1 more variable: slide_value <date>