The {epiprocess}
package works with epidemiological time series and version data to
provide situational awareness, processing and transformations in
preparation for modeling, and version-faithful model backtesting. It
contains:
-
epi_df
, a class for working with epidemiological time series data; -
epi_archive
, a class for working with the version history of such time series data; - sample data in these formats;
-
{dplyr}
-esque “verbs” for common data transformations (e.g., 7-day averages); - functions for exploratory data analysis and situational awareness (e.g., outlier detection and growth rate estimation); and
-
{dplyr}
-esque “verbs” for version-faithful “pseudoprospective” backtesting of models, and other version history analysis and transformations.
It is part of a broader suite of packages that includes {epipredict}
,
{epidatr}
,
{rtestim}
,
and {epidatasets}
,
for accessing, analyzing, and forecasting epidemiological time series
data. We have expanded documentation and demonstrations for some of
these packages available in an online “book” format here.
Motivation
{epiprocess}
and {epipredict}
are designed to lower the barrier to entry and implementation cost for
epidemiological time series analysis and forecasting. Epidemiologists
and forecasting groups repeatedly and separately have had to rush to
implement this type of functionality in a much more ad hoc manner; we
are trying to save such effort in the future by providing
well-documented, tested, and general packages that can be called for
many common tasks instead.
{epiprocess}
also provides tools to help avoid a particularly common pitfall in
analysis and forecasting: ignoring reporting latency and revisions to a
data set. This can, for example, lead to one retrospectively analyzing a
surveillance signal or forecasting model and concluding that it is much
more accurate than it actually was in real time, or producing
always-decreasing forecasts on data sets where initial surveillance
estimates are systematically revised upward. Storing and working with
version history can help avoid these issues.
Intended audience
We expect users to be proficient in R, and familiar with the {dplyr}
and {tidyr}
packages.
Installing
This package is not on CRAN yet, so it can be installed using the {devtools}
package:
devtools::install_github("cmu-delphi/epiprocess", ref = "main")
Building the vignettes, such as this getting started guide, takes a significant amount of time. They are not included in the package by default. If you want to include vignettes, then use this modified command:
devtools::install_github("cmu-delphi/epiprocess",
ref = "main",
build_vignettes = TRUE, dependencies = TRUE
)
Getting data into epi_df
format
We’ll start by showing how to get data into
epi_df
format, which
is just a tibble with a bit of special structure, and is the format
assumed by all of the functions in the epiprocess
package.
An epi_df
object has (at least) the following columns:
-
geo_value
: the geographic value associated with each row of measurements. -
time_value
: the time value associated with each row of measurements.
It can have any number of other columns which can serve as measured variables, which we also broadly refer to as signal variables. The documentation for gives more details about this data format.
A data frame or tibble that has geo_value
and
time_value
columns can be converted into an
epi_df
object, using the function as_epi_df()
.
As an example, we’ll work with daily cumulative COVID-19 cases from four
U.S. states: CA, FL, NY, and TX, over time span from mid 2020 to early
2022, and we’ll use the epidatr
package to fetch this data from the COVIDcast
API.
library(epidatr)
library(epiprocess)
library(dplyr)
library(tidyr)
library(withr)
cases <- pub_covidcast(
source = "jhu-csse",
signals = "confirmed_cumulative_num",
geo_type = "state",
time_type = "day",
geo_values = "ca,fl,ny,tx",
time_values = epirange(20200301, 20220131),
)
colnames(cases)
## [1] "geo_value" "signal" "source"
## [4] "geo_type" "time_type" "time_value"
## [7] "direction" "issue" "lag"
## [10] "missing_value" "missing_stderr" "missing_sample_size"
## [13] "value" "stderr" "sample_size"
As we can see, a data frame returned by
epidatr::pub_covidcast()
has the columns required for an
epi_df
object (along with many others). We can use
as_epi_df()
, with specification of some relevant metadata,
to bring the data frame into epi_df
format.
x <- as_epi_df(cases, as_of = max(cases$issue)) %>%
select(geo_value, time_value, total_cases = value)
class(x)
## [1] "epi_df" "tbl_df" "tbl" "data.frame"
summary(x)
## An `epi_df` x, with metadata:
## * geo_type = state
## * as_of = 2023-03-10
## ----------
## * min time value = 2020-03-01
## * max time value = 2022-01-31
## * average rows per time value = 4
head(x)
## An `epi_df` object, 6 x 3 with metadata:
## * geo_type = state
## * time_type = day
## * as_of = 2023-03-10
##
## # A tibble: 6 × 3
## geo_value time_value total_cases
## * <chr> <date> <dbl>
## 1 ca 2020-03-01 19
## 2 fl 2020-03-01 0
## 3 ny 2020-03-01 0
## 4 tx 2020-03-01 0
## 5 ca 2020-03-02 23
## 6 fl 2020-03-02 1
attributes(x)$metadata
## $geo_type
## [1] "state"
##
## $time_type
## [1] "day"
##
## $as_of
## [1] "2023-03-10"
##
## $other_keys
## character(0)
Some details on metadata
In general, an epi_df
object has the following fields in
its metadata:
-
geo_type
: the type for the geo values. -
as_of
: the time value at which the given data were available.
Metadata for an epi_df
object x
can be
accessed (and altered) via attributes(x)$metadata
. The
field, geo_type
,is not currently used by any downstream
functions in the epiprocess
package, and serve only as
useful bits of information to convey about the data set at hand. The
last field here, as_of
, is one of the most unique aspects
of an epi_df
object.
In brief, we can think of an epi_df
object as a single
snapshot of a data set that contains the most up-to-date values of some
signals of interest, as of the time specified as_of
. For
example, if as_of
is January 31, 2022, then the
epi_df
object has the most up-to-date version of the data
available as of January 31, 2022. The epiprocess
package
also provides a companion data structure called
epi_archive
, which stores the full version history of a
given data set. See the archive
vignette for more.
If geo_type
or as_of
arguments are missing
in a call to as_epi_df()
, then this function will try to
infer them from the passed object. Usually, geo_type
can be
inferred from the geo_value
columns, respectively, but
inferring the as_of
field is not as easy. See the
documentation for as_epi_df()
more details.
x <- as_epi_df(cases, as_of = as.Date("2024-03-20")) %>%
select(geo_value, time_value, total_cases = value)
attributes(x)$metadata
## $geo_type
## [1] "state"
##
## $time_type
## [1] "day"
##
## $as_of
## [1] "2024-03-20"
##
## $other_keys
## character(0)
Using additional key columns in epi_df
In the following examples we will show how to create an
epi_df
with additional keys.
Converting a tsibble
that has county code as an extra
key
ex1 <- tibble(
geo_value = rep(c("ca", "fl", "pa"), each = 3),
county_code = c(
"06059", "06061", "06067",
"12111", "12113", "12117",
"42101", "42103", "42105"
),
time_value = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = length(geo_value)),
value = seq_along(geo_value) + 0.01 * withr::with_rng_version("3.0.0", withr::with_seed(42, length(geo_value)))
) %>%
as_tsibble(index = time_value, key = c(geo_value, county_code))
ex1 <- as_epi_df(x = ex1, as_of = "2020-06-03")
The metadata now includes county_code
as an extra
key.
attr(ex1, "metadata")
## $geo_type
## [1] "state"
##
## $time_type
## [1] "day"
##
## $as_of
## [1] "2020-06-03"
##
## $other_keys
## [1] "county_code"
Dealing with misspecified column names
epi_df
requires there to be columns
geo_value
and time_value
, if they do not exist
then as_epi_df()
throws an error.
data.frame(
# misnamed
state = rep(c("ca", "fl", "pa"), each = 3),
# extra key
pol = rep(c("blue", "swing", "swing"), each = 3),
# misnamed
reported_date = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = 9),
value = 1:9 + 0.01 * withr::with_rng_version("3.0.0", withr::with_seed(42, 9))
) %>% as_epi_df(as_of = as.Date("2024-03-20"))
## Error in `guess_column_name()`:
## ! There is no time_value column or similar name. See e.g.
## [`time_column_name()`] for a complete list
The columns can be renamed to match epi_df
format. In
the example below, notice there is also an additional key
pol
.
ex2 <- tibble(
# misnamed
state = rep(c("ca", "fl", "pa"), each = 3),
# extra key
pol = rep(c("blue", "swing", "swing"), each = 3),
# misnamed
reported_date = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = length(state)),
value = seq_along(state) + 0.01 * withr::with_rng_version("3.0.0", withr::with_seed(42, length(state)))
) %>% data.frame()
head(ex2)
## state pol reported_date value
## 1 ca blue 2020-06-01 1.09
## 2 ca blue 2020-06-02 2.09
## 3 ca blue 2020-06-03 3.09
## 4 fl swing 2020-06-01 4.09
## 5 fl swing 2020-06-02 5.09
## 6 fl swing 2020-06-03 6.09
ex2 <- ex2 %>%
rename(geo_value = state, time_value = reported_date) %>%
as_epi_df(
as_of = "2020-06-03",
other_keys = "pol"
)
attr(ex2, "metadata")
## $geo_type
## [1] "state"
##
## $time_type
## [1] "day"
##
## $as_of
## [1] "2020-06-03"
##
## $other_keys
## [1] "pol"
Adding additional keys to an epi_df
object
In the above examples, all the keys are added to objects that are not
epi_df
objects. We illustrate how to add keys to an
epi_df
object.
We use a toy data set included in epiprocess
prepared
using the covidcast
library and are filtering to a single
state for simplicity.
ex3 <- jhu_csse_county_level_subset %>%
filter(time_value > "2021-12-01", state_name == "Massachusetts") %>%
slice_tail(n = 6)
attr(ex3, "metadata") # geo_type is county currently
## $geo_type
## [1] "county"
##
## $time_type
## [1] "day"
##
## $as_of
## [1] "2024-08-23 02:40:32 UTC"
##
## $other_keys
## character(0)
Now we add state
(MA) and pol
as new
columns to the data and as new keys to the metadata. Reminder that lower
case state name abbreviations are what we would expect if this were a
geo_value
column.
ex3 <- ex3 %>%
as_tibble() %>% # needed to add the additional metadata
mutate(
state = rep(tolower("MA"), 6),
pol = rep(c("blue", "swing", "swing"), each = 2)
) %>%
as_epi_df(other_keys = c("state", "pol"), as_of = as.Date("2024-03-20"))
attr(ex3, "metadata")
## $geo_type
## [1] "county"
##
## $time_type
## [1] "day"
##
## $as_of
## [1] "2024-03-20"
##
## $other_keys
## [1] "state" "pol"
Note that the two additional keys we added, state
and
pol
, are specified as a character vector in the
other_keys
argument. They must be specified in this manner
so that downstream actions on the epi_df
, like model
fitting and prediction, can recognize and use these keys.
Currently other_keys
metadata in epi_df
doesn’t impact epi_slide()
, contrary to
other_keys
in as_epi_archive
which affects how
the update data is interpreted.
Working with epi_df
objects downstream
Data in epi_df
format should be easy to work with
downstream, since it is a very standard tabular data format; in the
other vignettes, we’ll walk through some basic signal processing tasks
using functions provided in the epiprocess
package. Of
course, we can also write custom code for other downstream uses, like
plotting, which is pretty easy to do ggplot2
.
library(ggplot2)
theme_set(theme_bw())
ggplot(x, aes(x = time_value, y = total_cases, color = geo_value)) +
geom_line() +
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
labs(x = "Date", y = "Cumulative COVID-19 cases", color = "State")
As a last couple examples, we’ll look at some more data sets just to
show how we might get them into epi_df
format. Data on
daily new (not cumulative) SARS cases in Canada in 2003, from the outbreaks
package:
x <- outbreaks::sars_canada_2003 %>%
mutate(geo_value = "ca") %>%
select(geo_value, time_value = date, starts_with("cases")) %>%
as_epi_df(as_of = as.Date("2024-03-20"))
head(x)
## An `epi_df` object, 6 x 6 with metadata:
## * geo_type = state
## * time_type = day
## * as_of = 2024-03-20
##
## # A tibble: 6 × 6
## geo_value time_value cases_travel cases_household cases_healthcare cases_other
## * <chr> <date> <int> <int> <int> <int>
## 1 ca 2003-02-23 1 0 0 0
## 2 ca 2003-02-24 0 0 0 0
## 3 ca 2003-02-25 0 0 0 0
## 4 ca 2003-02-26 0 1 0 0
## 5 ca 2003-02-27 0 0 0 0
## 6 ca 2003-02-28 1 0 0 0
library(tidyr)
x <- x %>%
pivot_longer(starts_with("cases"), names_to = "type") %>%
mutate(type = substring(type, 7))
yrange <- range(
x %>%
group_by(time_value) %>%
summarize(value = sum(value)) %>%
pull(value)
)
ggplot(x, aes(x = time_value, y = value)) +
geom_col(aes(fill = type)) +
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
scale_y_continuous(breaks = yrange[1]:yrange[2]) +
labs(x = "Date", y = "SARS cases in Canada", fill = "Type")
Get confirmed cases of Ebola in Sierra Leone from 2014 to 2015 by province and date of onset, prepared from line list data from the same package:
x <- outbreaks::ebola_sierraleone_2014 %>%
select(district, date_of_onset, status) %>%
mutate(province = case_when(
district %in% c("Kailahun", "Kenema", "Kono") ~
"Eastern",
district %in% c(
"Bombali", "Kambia", "Koinadugu", "Port Loko",
"Tonkolili"
) ~
"Northern",
district %in% c("Bo", "Bonthe", "Moyamba", "Pujehun") ~
"Sourthern",
district %in% c("Western Rural", "Western Urban") ~
"Western"
)) %>%
group_by(geo_value = province, time_value = date_of_onset) %>%
summarise(cases = sum(status == "confirmed"), .groups = "drop") %>%
complete(geo_value,
time_value = full_seq(time_value, period = 1),
fill = list(cases = 0)
) %>%
as_epi_df(as_of = as.Date("2024-03-20"))
ggplot(x, aes(x = time_value, y = cases)) +
geom_col(aes(fill = geo_value), show.legend = FALSE) +
facet_wrap(~geo_value, scales = "free_y") +
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
labs(x = "Date", y = "Confirmed cases of Ebola in Sierra Leone")
Attribution
This document contains a dataset that is a modified part of the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University as republished in the COVIDcast Epidata API. This data set is licensed under the terms of the Creative Commons Attribution 4.0 International license by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
From the COVIDcast Epidata API: These signals are taken directly from the JHU CSSE COVID-19 GitHub repository without changes.