An epi_archive
is an S3 class which contains a data table
along with several relevant pieces of metadata. The data table can be seen
as the full archive (version history) for some signal variables of
interest.
Usage
new_epi_archive(
x,
geo_type,
time_type,
other_keys,
compactify,
clobberable_versions_start,
versions_end,
compactify_tol = .Machine$double.eps^0.5
)
validate_epi_archive(
x,
other_keys,
compactify,
clobberable_versions_start,
versions_end
)
as_epi_archive(
x,
geo_type = deprecated(),
time_type = deprecated(),
other_keys = character(),
compactify = NULL,
clobberable_versions_start = NA,
.versions_end = max_version_with_row_in(x),
...,
versions_end = .versions_end
)
Arguments
- x
A data.frame, data.table, or tibble, with columns
geo_value
,time_value
,version
, and then any additional number of columns.- geo_type
DEPRECATED Has no effect. Geo value type is inferred from the location column and set to "custom" if not recognized.
- time_type
DEPRECATED Has no effect. Time value type inferred from the time column and set to "custom" if not recognized. Unpredictable behavior may result if the time type is not recognized.
- other_keys
Character vector specifying the names of variables in
x
that should be considered key variables (in the language ofdata.table
) apart from "geo_value", "time_value", and "version". Typical examples are "age" or more granular geographies.- compactify
Optional; Boolean.
TRUE
will remove some redundant rows,FALSE
will not, and missing orNULL
will remove redundant rows, but issue a warning. See more information atcompactify
.- clobberable_versions_start
Optional;
length
-1; either a value of the sameclass
andtypeof
asx$version
, or anNA
of anyclass
andtypeof
: specifically, either (a) the earliest version that could be subject to "clobbering" (being overwritten with different update data, but using the same version tag as the old update data), or (b)NA
, to indicate that no versions are clobberable. There are a variety of reasons why versions could be clobberable under routine circumstances, such as (a) today's version of one/all of the columns being published after initially being filled withNA
or LOCF, (b) a buggy version of today's data being published but then fixed and republished later in the day, or (c) data pipeline delays (e.g., publisher uploading, periodic scraping, database syncing, periodic fetching, etc.) that make events (a) or (b) reflected later in the day (or even on a different day) than expected; potential causes vary between different data pipelines. The default value isNA
, which doesn't consider any versions to be clobberable. Another setting that may be appropriate for some pipelines ismax_version_with_row_in(x)
.- versions_end
Optional; length-1, same
class
andtypeof
asx$version
: what is the last version we have observed? The default ismax_version_with_row_in(x)
, but values greater than this could also be valid, and would indicate that we observed additional versions of the data beyondmax(x$version)
, but they all contained empty updates. (The default value ofclobberable_versions_start
does not fully trust these empty updates, and assumes that any version>= max(x$version)
could be clobbered.) Ifnrow(x) == 0
, then this argument is mandatory.- compactify_tol
double. the tolerance used to detect approximate equality for compactification
- .versions_end
location based versions_end, used to avoid prefix
version = issue
from being assigned toversions_end
instead of being used to rename columns.- ...
used for specifying column names, as in
dplyr::rename
. For exampleversion = release_date
Details
Epi Archive
An epi_archive
contains a data table DT
, of class data.table
from the data.table
package, with (at least) the following columns:
geo_value
: the geographic value associated with each row of measurements.time_value
: the time value associated with each row of measurements.version
: the time value specifying the version for each row of measurements. For example, if in a given row theversion
is January 15, 2022 andtime_value
is January 14, 2022, then this row contains the measurements of the data for January 14, 2022 that were available one day later.
The data table DT
has key variables geo_value
, time_value
, version
,
as well as any others (these can be specified when instantiating the
epi_archive
object via the other_keys
argument, and/or set by operating
on DT
directly). Note that there can only be a single row per unique
combination of key variables.
Metadata
The following pieces of metadata are included as fields in an epi_archive
object:
geo_type
: the type for the geo values.time_type
: the type for the time values.other_keys
: any additional keys as a character vector. Typical examples are "age" or sub-geographies.
While this metadata is not protected, it is generally recommended to treat it
as read-only, and to use the epi_archive
methods to interact with the data
archive. Unexpected behavior may result from modifying the metadata
directly.
Generating Snapshots
An epi_archive
object can be used to generate a snapshot of the data in
epi_df
format, which represents the most up-to-date time series values up
to a point in time. This is accomplished by calling epix_as_of()
.
Sliding Computations
We can run a sliding computation over an epi_archive
object, much like
epi_slide()
does for an epi_df
object. This is accomplished by calling
the slide()
method for an epi_archive
object, which works similarly to
the way epi_slide()
works for an epi_df
object, but with one key
difference: it is version-aware. That is, for an epi_archive
object, the
sliding computation at any given reference time point t is performed on
data that would have been available as of t.
Examples
# Simple ex. with necessary keys
tib <- tibble::tibble(
geo_value = rep(c("ca", "hi"), each = 5),
time_value = rep(seq(as.Date("2020-01-01"),
by = 1, length.out = 5
), times = 2),
version = rep(seq(as.Date("2020-01-02"),
by = 1, length.out = 5
), times = 2),
value = rnorm(10, mean = 2, sd = 1)
)
toy_epi_archive <- tib %>% as_epi_archive()
toy_epi_archive
#> → An `epi_archive` object, with metadata:
#> ℹ Min/max time values: 2020-01-01 / 2020-01-05
#> ℹ First/last version with update: 2020-01-02 / 2020-01-06
#> ℹ Versions end: 2020-01-06
#> ℹ A preview of the table (10 rows x 4 columns):
#> Key: <geo_value, time_value, version>
#> geo_value time_value version value
#> <char> <Date> <Date> <num>
#> 1: ca 2020-01-01 2020-01-02 0.5999565
#> 2: ca 2020-01-02 2020-01-03 2.2553171
#> 3: ca 2020-01-03 2020-01-04 -0.4372636
#> 4: ca 2020-01-04 2020-01-05 1.9944287
#> 5: ca 2020-01-05 2020-01-06 2.6215527
#> 6: hi 2020-01-01 2020-01-02 3.1484116
#> 7: hi 2020-01-02 2020-01-03 0.1781823
#> 8: hi 2020-01-03 2020-01-04 1.7526747
#> 9: hi 2020-01-04 2020-01-05 1.7558004
#> 10: hi 2020-01-05 2020-01-06 1.7172946
# Ex. with an additional key for county
df <- data.frame(
geo_value = c(replicate(2, "ca"), replicate(2, "fl")),
county = c(1, 3, 2, 5),
time_value = c(
"2020-06-01",
"2020-06-02",
"2020-06-01",
"2020-06-02"
),
version = c(
"2020-06-02",
"2020-06-03",
"2020-06-02",
"2020-06-03"
),
cases = c(1, 2, 3, 4),
cases_rate = c(0.01, 0.02, 0.01, 0.05)
)
x <- df %>% as_epi_archive(other_keys = "county")
#> Warning: Unsupported time type in column `x$time_value`, with class `character`.
#> Time-related functionality may have unexpected behavior.