One of the two main data structures for storing time series in epiprocess
.
It is simply tibble with at least two columns, geo_value
and time_value
,
that provide the keys for the time series. It can have any other columns,
which can be seen as measured variables at each key. In brief, an epi_df
represents a snapshot of an epidemiological data set at a point in time.
Usage
as_epi_df(x, ...)
# S3 method for class 'epi_df'
as_epi_df(x, ...)
# S3 method for class 'tbl_df'
as_epi_df(
x,
geo_type = deprecated(),
time_type = deprecated(),
as_of,
other_keys = character(),
...
)
# S3 method for class 'data.frame'
as_epi_df(x, as_of, other_keys = character(), ...)
# S3 method for class 'tbl_ts'
as_epi_df(x, as_of, other_keys = character(), ...)
new_epi_df(
x = tibble::tibble(geo_value = character(), time_value = as.Date(integer())),
geo_type,
time_type,
as_of,
other_keys = character(),
...
)
is_epi_df(x)
Arguments
- x
An object.
- ...
Additional arguments passed to methods.
- geo_type
in
as_epi_df()
, has no effect; the geo value type is inferred from the location column and set to "custom" if not recognized. Innew_epi_df()
, should be set to the same value that would be inferred.- time_type
in
as_epi_df()
, has no effect: the time value type inferred from the time column and set to "custom" if not recognized. Unpredictable behavior may result if the time type is not recognized. Innew_epi_df()
, should be set to the same value that would be inferred.- as_of
Time value representing the time at which the given data were available. For example, if
as_of
is January 31, 2022, then theepi_df
object that is created would represent the most up-to-date version of the data available as of January 31, 2022. If theas_of
argument is missing, then the current day-time will be used.- other_keys
If your tibble has additional keys, be sure to specify them as a character vector here (typical examples are "age" or sub-geographies).
Details
An epi_df
is a tibble with (at least) the following columns:
geo_value
: A character vector representing the geographical unit of observation. This could be a country code, a state name, a county code, etc.time_value
: A date or integer vector representing the time of observation.
Other columns can be considered as measured variables, which we also refer to
as signal variables. An epi_df
object also has metadata with (at least)
the following fields:
geo_type
: the type for the geo values.as_of
: the time value at which the given data were available.
Most users should use as_epi_df
. The input tibble x
to the constructor
must contain the columns geo_value
and time_value
. All other columns
will be preserved as is, and treated as measured variables. If as_of
is
missing, then the function will try to guess it from an as_of
, issue
,
or version
column of x
(if any of these are present), or from as an
as_of
field in its metadata (stored in its attributes); if this fails,
then the current day-time will be used. The new_epi_df
constructor
assumes its arguments have already been validated, so it should mainly be
used by advanced users.
Metadata for an epi_df
object x
can be accessed (and altered) via
attributes(x)$metadata
. The first field in the above list, geo_type
,
can usually be inferred from the geo_value
columns. They are not
currently used by any downstream functions in the epiprocess
package,
and serve only as useful bits of information to convey about the data set
at hand. More information on their coding is given below.
The last field in the above list, as_of
, is one of the most unique aspects
of an epi_df
object. In brief, we can think of an epi_df
object as a
single snapshot of a data set that contains the most up-to-date values of
the signals variables, as of the time specified in the as_of
field.
If an epi_df
ever loses its geo_value
or time_value
columns, it will
decay into a regular tibble.
A companion object is the epi_archive
object, which contains the full
version history of a given data set. Revisions are common in many types of
epidemiological data streams, and paying attention to data revisions can be
important for all sorts of downstream data analysis and modeling tasks. See
the documentation for epi_archive
for more details on how
data versioning works in the epiprocess
package (including how to
generate epi_df
objects, as data snapshots, from an epi_archive
object).
Geo Types
The following geo types are recognized in an epi_df
.
"county"
: each observation corresponds to a U.S. county; coded by 5-digit FIPS code."hrr"
: each observation corresponds to a U.S. hospital referral region (designed to represent regional healthcare markets); there are 306 HRRs in the U.S; coded by number (nonconsecutive, between 1 and 457)."state"
: each observation corresponds to a U.S. state; coded by 2-digit postal abbreviation (lowercase); note that Puerto Rico is "pr" and Washington D.C. is "dc"."hhs"
: each observation corresponds to a U.S. HHS region; coded by number (consecutive, between 1 and 10)."nation"
: each observation corresponds to a country; coded by ISO 31661- alpha-2 country codes (lowercase).
An unrecognizable geo type is labeled "custom".
Time Types
The following time types are recognized in an epi_df
.
"day"
: each observation corresponds to a day; coded as aDate
object, as inas.Date("2022-01-31")
."week"
: each observation corresponds to a week; the alignment can be arbitrary (as to whether a week starts on a Monday, Tuesday); coded as aDate
object, representing the start date of week."yearmonth"
: each observation corresponds to a month; coded as atsibble::yearmonth
object."integer"
: a generic integer index (e.g. years or something else).
An unrecognizable time type is labeled "custom".
Functions
as_epi_df()
: The preferred way of constructingepi_df
snew_epi_df()
: Lower-level constructor forepi_df
object
Examples
# Convert a `tsibble` that has county code as an extra key
# Notice that county code should be a character string to preserve any leading zeroes
ex1_input <- tibble::tibble(
geo_value = c(
"06059", "06061", "06067",
"12111", "12113", "12117",
"42101", "42103", "42105"
),
state_name = rep(c("ca", "fl", "pa"), each = 3),
time_value = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"),
by = "day"
), length.out = length(geo_value)),
value = 1:length(geo_value) + 0.01 * rnorm(length(geo_value))
) %>%
tsibble::as_tsibble(index = time_value, key = c(geo_value, state_name))
# The `other_keys` metadata (`"state_name"` in this case) is automatically
# inferred from the `tsibble`'s `key`:
ex1 <- as_epi_df(x = ex1_input, as_of = "2020-06-03")
attr(ex1, "metadata")[["other_keys"]]
#> [1] "state_name"
# Dealing with misspecified column names:
# Geographical and temporal information must be provided in columns named
# `geo_value` and `time_value`; if we start from a data frame with a
# different format, it must be converted to use `geo_value` and `time_value`
# before calling `as_epi_df`.
ex2_input <- tibble::tibble(
state = rep(c("ca", "fl", "pa"), each = 3), # misnamed
pol = rep(c("blue", "swing", "swing"), each = 3), # extra key
reported_date = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"),
by = "day"
), length.out = length(state)), # misnamed
value = 1:length(state) + 0.01 * rnorm(length(state))
)
print(ex2_input)
#> # A tibble: 9 × 4
#> state pol reported_date value
#> <chr> <chr> <date> <dbl>
#> 1 ca blue 2020-06-01 0.991
#> 2 ca blue 2020-06-02 2.00
#> 3 ca blue 2020-06-03 3.00
#> 4 fl swing 2020-06-01 3.99
#> 5 fl swing 2020-06-02 5.01
#> 6 fl swing 2020-06-03 6.02
#> 7 pa swing 2020-06-01 7.00
#> 8 pa swing 2020-06-02 7.99
#> 9 pa swing 2020-06-03 9.00
ex2 <- ex2_input %>%
dplyr::rename(geo_value = state, time_value = reported_date) %>%
as_epi_df(
as_of = "2020-06-03",
other_keys = "pol"
)
attr(ex2, "metadata")
#> $geo_type
#> [1] "state"
#>
#> $time_type
#> [1] "day"
#>
#> $as_of
#> [1] "2020-06-03"
#>
#> $other_keys
#> [1] "pol"
#>
# Adding additional keys to an `epi_df` object
ex3_input <- covid_incidence_county_subset %>%
dplyr::filter(time_value > "2021-12-01", state_name == "Massachusetts") %>%
dplyr::slice_tail(n = 6)
ex3 <- ex3_input %>%
tsibble::as_tsibble() %>% # needed to add the additional metadata
# add 2 extra keys
dplyr::mutate(
state = rep("MA", 6),
pol = rep(c("blue", "swing", "swing"), each = 2)
) %>%
as_epi_df(other_keys = c("state", "pol"))
attr(ex3, "metadata")
#> $geo_type
#> [1] "county"
#>
#> $time_type
#> [1] "day"
#>
#> $as_of
#> [1] "2024-10-22 20:05:58 UTC"
#>
#> $other_keys
#> [1] "state" "pol"
#>
# Decays to a tibble
covid_incidence_county_subset %>%
dplyr::select(-geo_value)
#> # A tibble: 16,212 × 4
#> time_value cases county_name state_name
#> * <date> <dbl> <chr> <chr>
#> 1 2020-06-01 4 Barnstable County Massachusetts
#> 2 2020-06-01 0 Berkshire County Massachusetts
#> 3 2020-06-01 78 Bristol County Massachusetts
#> 4 2020-06-01 0 Dukes County Massachusetts
#> 5 2020-06-01 92 Essex County Massachusetts
#> 6 2020-06-01 0 Franklin County Massachusetts
#> 7 2020-06-01 35 Hampden County Massachusetts
#> 8 2020-06-01 4 Hampshire County Massachusetts
#> 9 2020-06-01 98 Middlesex County Massachusetts
#> 10 2020-06-01 0 Nantucket County Massachusetts
#> # ℹ 16,202 more rows