Skip to contents

Slides a given function over variables in an epi_archive object. This behaves similarly to epi_slide(), with the key exception that it is version-aware: the sliding computation at any given reference time t is performed on data that would have been available as of t. See the archive vignette for examples.

Usage

epix_slide(
  x,
  f,
  ...,
  before = Inf,
  ref_time_values = NULL,
  new_col_name = "slide_value",
  as_list_col = FALSE,
  names_sep = "_",
  all_versions = FALSE
)

# S3 method for class 'epi_archive'
epix_slide(
  x,
  f,
  ...,
  before = Inf,
  ref_time_values = NULL,
  new_col_name = "slide_value",
  as_list_col = FALSE,
  names_sep = "_",
  all_versions = FALSE
)

# S3 method for class 'grouped_epi_archive'
epix_slide(
  x,
  f,
  ...,
  before = Inf,
  ref_time_values = NULL,
  new_col_name = "slide_value",
  as_list_col = FALSE,
  names_sep = "_",
  all_versions = FALSE
)

Arguments

x

An epi_archive or grouped_epi_archive object. If ungrouped, all data in x will be treated as part of a single data group.

f

Function, formula, or missing; together with ... specifies the computation to slide. To "slide" means to apply a computation over a sliding (a.k.a. "rolling") time window for each data group. The window is determined by the before parameter described below. One time step is typically one day or one week; see epi_slide details for more explanation. If a function, f must take an epi_df with the same column names as the archive's DT, minus the version column; followed by a one-row tibble containing the values of the grouping variables for the associated group; followed by a reference time value, usually as a Date object; followed by any number of named arguments. If a formula, f can operate directly on columns accessed via .x$var or .$var, as in ~ mean (.x$var) to compute a mean of a column var for each group-ref_time_value combination. The group key can be accessed via .y or .group_key, and the reference time value can be accessed via .z or .ref_time_value. If f is missing, then ... will specify the computation.

...

Additional arguments to pass to the function or formula specified via f. Alternatively, if f is missing, then ... is interpreted as an expression for tidy evaluation; in addition to referring to columns directly by name, the expression has access to .data and .env pronouns as in dplyr verbs, and can also refer to the .group_key and .ref_time_value. See details of epi_slide.

before

How far before each ref_time_value should the sliding window extend? If provided, should be a single, non-NA, integer-compatible number of time steps. This window endpoint is inclusive. For example, if before = 7, and one time step is one day, then to produce a value for a ref_time_value of January 8, we apply the given function or formula to data (for each group present) with time_values from January 1 onward, as they were reported on January 8. For typical disease surveillance sources, this will not include any data with a time_value of January 8, and, depending on the amount of reporting latency, may not include January 7 or even earlier time_values. (If instead the archive were to hold nowcasts instead of regular surveillance data, then we would indeed expect data for time_value January 8. If it were to hold forecasts, then we would expect data for time_values after January 8, and the sliding window would extend as far after each ref_time_value as needed to include all such time_values.)

ref_time_values

Reference time values / versions for sliding computations; each element of this vector serves both as the anchor point for the time_value window for the computation and the max_version epix_as_of which we fetch data in this window. If missing, then this will set to a regularly-spaced sequence of values set to cover the range of versions in the DT plus the versions_end; the spacing of values will be guessed (using the GCD of the skips between values).

new_col_name

String indicating the name of the new column that will contain the derivative values. Default is "slide_value"; note that setting new_col_name equal to an existing column name will overwrite this column.

as_list_col

Should the slide results be held in a list column, or be unchopped/unnested? Default is FALSE, in which case a list object returned by f would be unnested (using tidyr::unnest()), and, if the slide computations output data frames, the names of the resulting columns are given by prepending new_col_name to the names of the list elements.

names_sep

String specifying the separator to use in tidyr::unnest() when as_list_col = FALSE. Default is "_". Using NULL drops the prefix from new_col_name entirely.

all_versions

(Not the same as all_rows parameter of epi_slide.) If all_versions = TRUE, then f will be passed the version history (all version <= ref_time_value) for rows having time_value between ref_time_value - before and ref_time_value. Otherwise, f will be passed only the most recent version for every unique time_value. Default is FALSE.

Value

A tibble whose columns are: the grouping variables, time_value, containing the reference time values for the slide computation, and a column named according to the new_col_name argument, containing the slide values.

Details

A few key distinctions between the current function and epi_slide():

  1. In f functions for epix_slide, one should not assume that the input data to contain any rows with time_value matching the computation's ref_time_value (accessible via attributes(<data>)$metadata$as_of); for typical epidemiological surveillance data, observations pertaining to a particular time period (time_value) are first reported as_of some instant after that time period has ended.

  2. epix_slide() doesn't accept an after argument; its windows extend from before time steps before a given ref_time_value through the last time_value available as of version ref_time_value (typically, this won't include ref_time_value itself, as observations about a particular time interval (e.g., day) are only published after that time interval ends); epi_slide windows extend from before time steps before a ref_time_value through after time steps after ref_time_value.

  3. The input class and columns are similar but different: epix_slide (with the default all_versions=FALSE) keeps all columns and the epi_df-ness of the first argument to each computation; epi_slide only provides the grouping variables in the second input, and will convert the first input into a regular tibble if the grouping variables include the essential geo_value column. (With all_versions=TRUE, epix_slide will will provide an epi_archive rather than an epi-df to each computation.)

  4. The output class and columns are similar but different: epix_slide() returns a tibble containing only the grouping variables, time_value, and the new column(s) from the slide computations, whereas epi_slide() returns an epi_df with all original variables plus the new columns from the slide computations. (Both will mirror the grouping or ungroupedness of their input, with one exception: epi_archives can have trivial (zero-variable) groupings, but these will be dropped in epix_slide results as they are not supported by tibbles.)

  5. There are no size stability checks or element/row recycling to maintain size stability in epix_slide, unlike in epi_slide. (epix_slide is roughly analogous to dplyr::group_modify, while epi_slide is roughly analogous to dplyr::mutate followed by dplyr::arrange) This is detailed in the "advanced" vignette.

  6. all_rows is not supported in epix_slide; since the slide computations are allowed more flexibility in their outputs than in epi_slide, we can't guess a good representation for missing computations for excluded group-ref_time_value pairs.

  7. The ref_time_values default for epix_slide is based on making an evenly-spaced sequence out of the versions in the DT plus the versions_end, rather than the time_values.

Apart from the above distinctions, the interfaces between epix_slide() and epi_slide() are the same.

Furthermore, the current function can be considerably slower than epi_slide(), for two reasons: (1) it must repeatedly fetch properly-versioned snapshots from the data archive (via epix_as_of()), and (2) it performs a "manual" sliding of sorts, and does not benefit from the highly efficient slider package. For this reason, it should never be used in place of epi_slide(), and only used when version-aware sliding is necessary (as it its purpose).

Examples

library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union

# Reference time points for which we want to compute slide values:
ref_time_values <- seq(as.Date("2020-06-01"),
  as.Date("2020-06-15"),
  by = "1 day"
)

# A simple (but not very useful) example (see the archive vignette for a more
# realistic one):
archive_cases_dv_subset %>%
  group_by(geo_value) %>%
  epix_slide(
    f = ~ mean(.x$case_rate_7d_av),
    before = 2,
    ref_time_values = ref_time_values,
    new_col_name = "case_rate_7d_av_recent_av"
  ) %>%
  ungroup()
#> # A tibble: 57 × 3
#>    geo_value time_value case_rate_7d_av_recent_av
#>    <chr>     <date>                         <dbl>
#>  1 NA        2020-06-01                    NaN   
#>  2 ca        2020-06-02                      6.63
#>  3 fl        2020-06-02                      3.38
#>  4 ny        2020-06-02                      6.57
#>  5 tx        2020-06-02                      4.52
#>  6 ca        2020-06-03                      6.54
#>  7 fl        2020-06-03                      3.42
#>  8 ny        2020-06-03                      6.66
#>  9 tx        2020-06-03                      4.75
#> 10 ca        2020-06-04                      6.53
#> # ℹ 47 more rows
# We requested time windows that started 2 days before the corresponding time
# values. The actual number of `time_value`s in each computation depends on
# the reporting latency of the signal and `time_value` range covered by the
# archive (2020-06-01 -- 2021-11-30 in this example).  In this case, we have
# * 0 `time_value`s, for ref time 2020-06-01 --> the result is automatically
#                                                discarded
# * 1 `time_value`, for ref time 2020-06-02
# * 2 `time_value`s, for the rest of the results
# * never the 3 `time_value`s we would get from `epi_slide`, since, because
#   of data latency, we'll never have an observation
#   `time_value == ref_time_value` as of `ref_time_value`.
# The example below shows this type of behavior in more detail.

# Examining characteristics of the data passed to each computation with
# `all_versions=FALSE`.
archive_cases_dv_subset %>%
  group_by(geo_value) %>%
  epix_slide(
    function(x, gk, rtv) {
      tibble(
        time_range = if (nrow(x) == 0L) {
          "0 `time_value`s"
        } else {
          sprintf("%s -- %s", min(x$time_value), max(x$time_value))
        },
        n = nrow(x),
        class1 = class(x)[[1L]]
      )
    },
    before = 5, all_versions = FALSE,
    ref_time_values = ref_time_values, names_sep = NULL
  ) %>%
  ungroup() %>%
  arrange(geo_value, time_value)
#> # A tibble: 57 × 5
#>    geo_value time_value time_range                   n class1
#>    <chr>     <date>     <chr>                    <int> <chr> 
#>  1 ca        2020-06-02 2020-06-01 -- 2020-06-01     1 epi_df
#>  2 ca        2020-06-03 2020-06-01 -- 2020-06-02     2 epi_df
#>  3 ca        2020-06-04 2020-06-01 -- 2020-06-03     3 epi_df
#>  4 ca        2020-06-05 2020-06-01 -- 2020-06-04     4 epi_df
#>  5 ca        2020-06-06 2020-06-01 -- 2020-06-05     5 epi_df
#>  6 ca        2020-06-07 2020-06-02 -- 2020-06-06     5 epi_df
#>  7 ca        2020-06-08 2020-06-03 -- 2020-06-07     5 epi_df
#>  8 ca        2020-06-09 2020-06-04 -- 2020-06-08     5 epi_df
#>  9 ca        2020-06-10 2020-06-05 -- 2020-06-09     5 epi_df
#> 10 ca        2020-06-11 2020-06-06 -- 2020-06-10     5 epi_df
#> # ℹ 47 more rows

# --- Advanced: ---

# `epix_slide` with `all_versions=FALSE` (the default) applies a
# version-unaware computation to several versions of the data. We can also
# use `all_versions=TRUE` to apply a version-*aware* computation to several
# versions of the data, again looking at characteristics of the data passed
# to each computation. In this case, each computation should expect an
# `epi_archive` containing the relevant version data:

archive_cases_dv_subset %>%
  group_by(geo_value) %>%
  epix_slide(
    function(x, gk, rtv) {
      tibble(
        versions_start = if (nrow(x$DT) == 0L) {
          "NA (0 rows)"
        } else {
          toString(min(x$DT$version))
        },
        versions_end = x$versions_end,
        time_range = if (nrow(x$DT) == 0L) {
          "0 `time_value`s"
        } else {
          sprintf("%s -- %s", min(x$DT$time_value), max(x$DT$time_value))
        },
        n = nrow(x$DT),
        class1 = class(x)[[1L]]
      )
    },
    before = 5, all_versions = TRUE,
    ref_time_values = ref_time_values, names_sep = NULL
  ) %>%
  ungroup() %>%
  # Focus on one geo_value so we can better see the columns above:
  filter(geo_value == "ca") %>%
  select(-geo_value)
#> # A tibble: 14 × 6
#>    time_value versions_start versions_end time_range                   n class1 
#>    <date>     <chr>          <date>       <chr>                    <int> <chr>  
#>  1 2020-06-02 2020-06-02     2020-06-02   2020-06-01 -- 2020-06-01     1 epi_ar…
#>  2 2020-06-03 2020-06-02     2020-06-03   2020-06-01 -- 2020-06-02     2 epi_ar…
#>  3 2020-06-04 2020-06-02     2020-06-04   2020-06-01 -- 2020-06-03     3 epi_ar…
#>  4 2020-06-05 2020-06-02     2020-06-05   2020-06-01 -- 2020-06-04     4 epi_ar…
#>  5 2020-06-06 2020-06-02     2020-06-06   2020-06-01 -- 2020-06-05     8 epi_ar…
#>  6 2020-06-07 2020-06-03     2020-06-07   2020-06-02 -- 2020-06-06     9 epi_ar…
#>  7 2020-06-08 2020-06-04     2020-06-08   2020-06-03 -- 2020-06-07     9 epi_ar…
#>  8 2020-06-09 2020-06-05     2020-06-09   2020-06-04 -- 2020-06-08     8 epi_ar…
#>  9 2020-06-10 2020-06-06     2020-06-10   2020-06-05 -- 2020-06-09     8 epi_ar…
#> 10 2020-06-11 2020-06-07     2020-06-11   2020-06-06 -- 2020-06-10     8 epi_ar…
#> 11 2020-06-12 2020-06-08     2020-06-12   2020-06-07 -- 2020-06-11     8 epi_ar…
#> 12 2020-06-13 2020-06-09     2020-06-13   2020-06-08 -- 2020-06-12     8 epi_ar…
#> 13 2020-06-14 2020-06-10     2020-06-14   2020-06-09 -- 2020-06-13     8 epi_ar…
#> 14 2020-06-15 2020-06-11     2020-06-15   2020-06-10 -- 2020-06-14     8 epi_ar…