Slide a function over variables in an epi_archive
or grouped_epi_archive
Source: R/methods-epi_archive.R
, R/grouped_epi_archive.R
epix_slide.Rd
Slides a given function over variables in an epi_archive
object. This
behaves similarly to epi_slide()
, with the key exception that it is
version-aware: the sliding computation at any given reference time t is
performed on data that would have been available as of t. This function
is intended for use in accurate backtesting of models; see
vignette("backtesting", package="epipredict")
for a walkthrough.
Usage
epix_slide(
.x,
.f,
...,
.before = Inf,
.versions = NULL,
.new_col_name = NULL,
.all_versions = FALSE
)
# S3 method for class 'epi_archive'
epix_slide(
.x,
.f,
...,
.before = Inf,
.versions = NULL,
.new_col_name = NULL,
.all_versions = FALSE
)
# S3 method for class 'grouped_epi_archive'
epix_slide(
.x,
.f,
...,
.before = Inf,
.versions = NULL,
.new_col_name = NULL,
.all_versions = FALSE
)
Arguments
- .x
An
epi_archive
orgrouped_epi_archive
object. If ungrouped, all data inx
will be treated as part of a single data group.- .f
Function, formula, or missing; together with
...
specifies the computation to slide. To "slide" means to apply a computation over a sliding (a.k.a. "rolling") time window for each data group. The window is determined by the.before
parameter (see details for more). If a function,.f
must have the formfunction(x, g, t, ...)
, where"x" is an epi_df with the same column names as the archive's
DT
, minus theversion
column"g" is a one-row tibble containing the values of the grouping variables for the associated group
"t" is the ref_time_value for the current window
"..." are additional arguments
If a formula,
.f
can operate directly on columns accessed via.x$var
or.$var
, as in~ mean (.x$var)
to compute a mean of a columnvar
for each group-ref_time_value
combination. The group key can be accessed via.y
or.group_key
, and the reference time value can be accessed via.z
or.ref_time_value
. If.f
is missing, then...
will specify the computation.- ...
Additional arguments to pass to the function or formula specified via
f
. Alternatively, if.f
is missing, then the...
is interpreted as a "data-masking" expression or expressions for tidy evaluation; in addition to referring columns directly by name, the expressions have access to.data
and.env
pronouns as indplyr
verbs, and can also refer to.x
(not the same as the input epi_archive),.group_key
, and.ref_time_value
. See details for more.- .before
How many time values before the
.ref_time_value
should each snapshot handed to the function.f
contain? If provided, it should be a single value that is compatible with the time_type of the time_value column (more below), but most commonly an integer. This window endpoint is inclusive. For example, if.before = 7
,time_type
in the archive is "day", and the.ref_time_value
is January 8, then the smallest time_value in the snapshot will be January 1. If missing, then the default is no limit on the time values, so the full snapshot is given.- .versions
Reference time values / versions for sliding computations; each element of this vector serves both as the anchor point for the
time_value
window for the computation and themax_version
epix_as_of
which we fetch data in this window. If missing, then this will set to a regularly-spaced sequence of values set to cover the range ofversion
s in theDT
plus theversions_end
; the spacing of values will be guessed (using the GCD of the skips between values).- .new_col_name
Either
NULL
or a string indicating the name of the new column that will contain the derived values. The default,NULL
, will use the name "slide_value" unless your slide computations output data frames, in which case they will be unpacked into the constituent columns and those names used. If the resulting column name(s) overlap with the column names used for labeling the computations, which aregroup_vars(x)
and"version"
, then the values for these columns must be identical to the labels we assign.- .all_versions
(Not the same as
.all_rows
parameter ofepi_slide
.) If.all_versions = TRUE
, then the slide computation will be passed the version history (allversion <= .version
where.version
is one of the requested.versions
) for rows having atime_value
of at least `.versionbefore
. Otherwise, the slide computation will be passed only the most recent
versionfor every unique
time_value. Default is
FALSE`.
Value
A tibble whose columns are: the grouping variables, time_value
,
containing the reference time values for the slide computation, and a
column named according to the .new_col_name
argument, containing the slide
values.
Details
A few key distinctions between the current function and epi_slide()
:
In
.f
functions forepix_slide
, one should not assume that the input data to contain any rows withtime_value
matching the computation's.ref_time_value
(accessible viaattributes(<data>)$metadata$as_of
); for typical epidemiological surveillance data, observations pertaining to a particular time period (time_value
) are first reportedas_of
some instant after that time period has ended.The input class and columns are similar but different:
epix_slide
(with the default.all_versions=FALSE
) keeps all columns and theepi_df
-ness of the first argument to each computation;epi_slide
only provides the grouping variables in the second input, and will convert the first input into a regular tibble if the grouping variables include the essentialgeo_value
column. (With .all_versions=TRUE,
epix_slidewill will provide an
epi_archiverather than an
epi-df` to each computation.)The output class and columns are similar but different:
epix_slide()
returns a tibble containing only the grouping variables,time_value
, and the new column(s) from the slide computations, whereasepi_slide()
returns anepi_df
with all original variables plus the new columns from the slide computations. (Both will mirror the grouping or ungroupedness of their input, with one exception:epi_archive
s can have trivial (zero-variable) groupings, but these will be dropped inepix_slide
results as they are not supported by tibbles.)There are no size stability checks or element/row recycling to maintain size stability in
epix_slide
, unlike inepi_slide
. (epix_slide
is roughly analogous todplyr::group_modify
, whileepi_slide
is roughly analogous todplyr::mutate
followed bydplyr::arrange
) This is detailed in the "advanced" vignette..all_rows
is not supported inepix_slide
; since the slide computations are allowed more flexibility in their outputs than inepi_slide
, we can't guess a good representation for missing computations for excluded group-.ref_time_value
pairs.The
.versions
default forepix_slide
is based on making an evenly-spaced sequence out of theversion
s in theDT
plus theversions_end
, rather than thetime_value
s.
Apart from the above distinctions, the interfaces between epix_slide()
and
epi_slide()
are the same.
Furthermore, the current function can be considerably slower than
epi_slide()
, for two reasons: (1) it must repeatedly fetch
properly-versioned snapshots from the data archive (via epix_as_of()
),
and (2) it performs a "manual" sliding of sorts, and does not benefit from
the highly efficient slider
package. For this reason, it should never be
used in place of epi_slide()
, and only used when version-aware sliding is
necessary (as it its purpose).
Examples
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
# Reference time points for which we want to compute slide values:
versions <- seq(as.Date("2020-06-02"),
as.Date("2020-06-15"),
by = "1 day"
)
# A simple (but not very useful) example (see the archive vignette for a more
# realistic one):
archive_cases_dv_subset %>%
group_by(geo_value) %>%
epix_slide(
.f = ~ mean(.x$case_rate_7d_av),
.before = 2,
.versions = versions,
.new_col_name = "case_rate_7d_av_recent_av"
) %>%
ungroup()
#> # A tibble: 56 × 3
#> geo_value version case_rate_7d_av_recent_av
#> <chr> <date> <dbl>
#> 1 ca 2020-06-02 6.63
#> 2 fl 2020-06-02 3.38
#> 3 ny 2020-06-02 6.57
#> 4 tx 2020-06-02 4.52
#> 5 ca 2020-06-03 6.54
#> 6 fl 2020-06-03 3.42
#> 7 ny 2020-06-03 6.66
#> 8 tx 2020-06-03 4.75
#> 9 ca 2020-06-04 6.53
#> 10 fl 2020-06-04 3.77
#> # ℹ 46 more rows
# We requested time windows that started 2 days before the corresponding time
# values. The actual number of `time_value`s in each computation depends on
# the reporting latency of the signal and `time_value` range covered by the
# archive (2020-06-01 -- 2021-11-30 in this example). In this case, we have
# * 0 `time_value`s, for ref time 2020-06-01 --> the result is automatically
# discarded
# * 1 `time_value`, for ref time 2020-06-02
# * 2 `time_value`s, for the rest of the results
# * never the 3 `time_value`s we would get from `epi_slide`, since, because
# of data latency, we'll never have an observation
# `time_value == .ref_time_value` as of `.ref_time_value`.
# The example below shows this type of behavior in more detail.
# Examining characteristics of the data passed to each computation with
# `all_versions=FALSE`.
archive_cases_dv_subset %>%
group_by(geo_value) %>%
epix_slide(
function(x, gk, rtv) {
tibble(
time_range = if (nrow(x) == 0L) {
"0 `time_value`s"
} else {
sprintf("%s -- %s", min(x$time_value), max(x$time_value))
},
n = nrow(x),
class1 = class(x)[[1L]]
)
},
.before = 5, .all_versions = FALSE,
.versions = versions
) %>%
ungroup() %>%
arrange(geo_value, version)
#> # A tibble: 56 × 5
#> geo_value version time_range n class1
#> <chr> <date> <chr> <int> <chr>
#> 1 ca 2020-06-02 2020-06-01 -- 2020-06-01 1 epi_df
#> 2 ca 2020-06-03 2020-06-01 -- 2020-06-02 2 epi_df
#> 3 ca 2020-06-04 2020-06-01 -- 2020-06-03 3 epi_df
#> 4 ca 2020-06-05 2020-06-01 -- 2020-06-04 4 epi_df
#> 5 ca 2020-06-06 2020-06-01 -- 2020-06-05 5 epi_df
#> 6 ca 2020-06-07 2020-06-02 -- 2020-06-06 5 epi_df
#> 7 ca 2020-06-08 2020-06-03 -- 2020-06-07 5 epi_df
#> 8 ca 2020-06-09 2020-06-04 -- 2020-06-08 5 epi_df
#> 9 ca 2020-06-10 2020-06-05 -- 2020-06-09 5 epi_df
#> 10 ca 2020-06-11 2020-06-06 -- 2020-06-10 5 epi_df
#> # ℹ 46 more rows
# --- Advanced: ---
# `epix_slide` with `all_versions=FALSE` (the default) applies a
# version-unaware computation to several versions of the data. We can also
# use `.all_versions=TRUE` to apply a version-*aware* computation to several
# versions of the data, again looking at characteristics of the data passed
# to each computation. In this case, each computation should expect an
# `epi_archive` containing the relevant version data:
archive_cases_dv_subset %>%
group_by(geo_value) %>%
epix_slide(
function(x, gk, rtv) {
tibble(
versions_start = if (nrow(x$DT) == 0L) {
"NA (0 rows)"
} else {
toString(min(x$DT$version))
},
versions_end = x$versions_end,
time_range = if (nrow(x$DT) == 0L) {
"0 `time_value`s"
} else {
sprintf("%s -- %s", min(x$DT$time_value), max(x$DT$time_value))
},
n = nrow(x$DT),
class1 = class(x)[[1L]]
)
},
.before = 5, .all_versions = TRUE,
.versions = versions
) %>%
ungroup() %>%
# Focus on one geo_value so we can better see the columns above:
filter(geo_value == "ca") %>%
select(-geo_value)
#> # A tibble: 14 × 6
#> version versions_start versions_end time_range n class1
#> <date> <chr> <date> <chr> <int> <chr>
#> 1 2020-06-02 2020-06-02 2020-06-02 2020-06-01 -- 2020-06-01 1 epi_ar…
#> 2 2020-06-03 2020-06-02 2020-06-03 2020-06-01 -- 2020-06-02 2 epi_ar…
#> 3 2020-06-04 2020-06-02 2020-06-04 2020-06-01 -- 2020-06-03 3 epi_ar…
#> 4 2020-06-05 2020-06-02 2020-06-05 2020-06-01 -- 2020-06-04 4 epi_ar…
#> 5 2020-06-06 2020-06-02 2020-06-06 2020-06-01 -- 2020-06-05 8 epi_ar…
#> 6 2020-06-07 2020-06-03 2020-06-07 2020-06-02 -- 2020-06-06 9 epi_ar…
#> 7 2020-06-08 2020-06-04 2020-06-08 2020-06-03 -- 2020-06-07 9 epi_ar…
#> 8 2020-06-09 2020-06-05 2020-06-09 2020-06-04 -- 2020-06-08 8 epi_ar…
#> 9 2020-06-10 2020-06-06 2020-06-10 2020-06-05 -- 2020-06-09 8 epi_ar…
#> 10 2020-06-11 2020-06-07 2020-06-11 2020-06-06 -- 2020-06-10 8 epi_ar…
#> 11 2020-06-12 2020-06-08 2020-06-12 2020-06-07 -- 2020-06-11 8 epi_ar…
#> 12 2020-06-13 2020-06-09 2020-06-13 2020-06-08 -- 2020-06-12 8 epi_ar…
#> 13 2020-06-14 2020-06-10 2020-06-14 2020-06-09 -- 2020-06-13 8 epi_ar…
#> 14 2020-06-15 2020-06-11 2020-06-15 2020-06-10 -- 2020-06-14 8 epi_ar…