Skip to contents

The {epiprocess} package works with epidemiological time series and version data to provide situational awareness, processing and transformations in preparation for modeling, and version-faithful model backtesting. It contains:

  • epi_df, a class for working with epidemiological time series data;
  • epi_archive, a class for working with the version history of such time series data;
  • sample data in these formats;
  • {dplyr}-esque “verbs” for common data transformations (e.g., 7-day averages);
  • functions for exploratory data analysis and situational awareness (e.g., outlier detection and growth rate estimation); and
  • {dplyr}-esque “verbs” for version-faithful “pseudoprospective” backtesting of models, and other version history analysis and transformations.

It is part of a broader suite of packages that includes {epipredict}, {epidatr}, {rtestim}, and {epidatasets}, for accessing, analyzing, and forecasting epidemiological time series data. We have expanded documentation and demonstrations for some of these packages available in an online “book” format here.

Motivation

{epiprocess} and {epipredict} are designed to lower the barrier to entry and implementation cost for epidemiological time series analysis and forecasting. Epidemiologists and forecasting groups repeatedly and separately have had to rush to implement this type of functionality in a much more ad hoc manner; we are trying to save such effort in the future by providing well-documented, tested, and general packages that can be called for many common tasks instead.

{epiprocess} also provides tools to help avoid a particularly common pitfall in analysis and forecasting: ignoring reporting latency and revisions to a data set. This can, for example, lead to one retrospectively analyzing a surveillance signal or forecasting model and concluding that it is much more accurate than it actually was in real time, or producing always-decreasing forecasts on data sets where initial surveillance estimates are systematically revised upward. Storing and working with version history can help avoid these issues.

Intended audience

We expect users to be proficient in R, and familiar with the {dplyr} and {tidyr} packages.

Installing

This package is not on CRAN yet, so it can be installed using the {devtools} package:

devtools::install_github("cmu-delphi/epiprocess", ref = "main")

Building the vignettes, such as this getting started guide, takes a significant amount of time. They are not included in the package by default. If you want to include vignettes, then use this modified command:

devtools::install_github("cmu-delphi/epiprocess",
  ref = "main",
  build_vignettes = TRUE, dependencies = TRUE
)

Getting data into epi_df format

We’ll start by showing how to get data into epi_df format, which is just a tibble with a bit of special structure, and is the format assumed by all of the functions in the epiprocess package. An epi_df object has (at least) the following columns:

  • geo_value: the geographic value associated with each row of measurements.
  • time_value: the time value associated with each row of measurements.

It can have any number of other columns which can serve as measured variables, which we also broadly refer to as signal variables. The documentation for gives more details about this data format.

A data frame or tibble that has geo_value and time_value columns can be converted into an epi_df object, using the function as_epi_df(). As an example, we’ll work with daily cumulative COVID-19 cases from four U.S. states: CA, FL, NY, and TX, over time span from mid 2020 to early 2022, and we’ll use the epidatr package to fetch this data from the COVIDcast API.

library(epidatr)
library(epiprocess)
library(dplyr)
library(tidyr)
library(withr)

cases <- pub_covidcast(
  source = "jhu-csse",
  signals = "confirmed_cumulative_num",
  geo_type = "state",
  time_type = "day",
  geo_values = "ca,fl,ny,tx",
  time_values = epirange(20200301, 20220131),
)

colnames(cases)
##  [1] "geo_value"           "signal"              "source"             
##  [4] "geo_type"            "time_type"           "time_value"         
##  [7] "direction"           "issue"               "lag"                
## [10] "missing_value"       "missing_stderr"      "missing_sample_size"
## [13] "value"               "stderr"              "sample_size"

As we can see, a data frame returned by epidatr::pub_covidcast() has the columns required for an epi_df object (along with many others). We can use as_epi_df(), with specification of some relevant metadata, to bring the data frame into epi_df format.

x <- as_epi_df(cases, as_of = max(cases$issue)) %>%
  select(geo_value, time_value, total_cases = value)

class(x)
## [1] "epi_df"     "tbl_df"     "tbl"        "data.frame"
## An `epi_df` x, with metadata:
## * geo_type  = state
## * as_of     = 2023-03-10
## ----------
## * min time value              = 2020-03-01
## * max time value              = 2022-01-31
## * average rows per time value = 4
head(x)
## An `epi_df` object, 6 x 3 with metadata:
## * geo_type  = state
## * time_type = day
## * as_of     = 2023-03-10
## 
## # A tibble: 6 × 3
##   geo_value time_value total_cases
## * <chr>     <date>           <dbl>
## 1 ca        2020-03-01          19
## 2 fl        2020-03-01           0
## 3 ny        2020-03-01           0
## 4 tx        2020-03-01           0
## 5 ca        2020-03-02          23
## 6 fl        2020-03-02           1
attributes(x)$metadata
## $geo_type
## [1] "state"
## 
## $time_type
## [1] "day"
## 
## $as_of
## [1] "2023-03-10"
## 
## $other_keys
## character(0)

Some details on metadata

In general, an epi_df object has the following fields in its metadata:

  • geo_type: the type for the geo values.
  • as_of: the time value at which the given data were available.

Metadata for an epi_df object x can be accessed (and altered) via attributes(x)$metadata. The field, geo_type,is not currently used by any downstream functions in the epiprocess package, and serve only as useful bits of information to convey about the data set at hand. The last field here, as_of, is one of the most unique aspects of an epi_df object.

In brief, we can think of an epi_df object as a single snapshot of a data set that contains the most up-to-date values of some signals of interest, as of the time specified as_of. For example, if as_of is January 31, 2022, then the epi_df object has the most up-to-date version of the data available as of January 31, 2022. The epiprocess package also provides a companion data structure called epi_archive, which stores the full version history of a given data set. See the archive vignette for more.

If geo_type or as_of arguments are missing in a call to as_epi_df(), then this function will try to infer them from the passed object. Usually, geo_type can be inferred from the geo_value columns, respectively, but inferring the as_of field is not as easy. See the documentation for as_epi_df() more details.

x <- as_epi_df(cases, as_of = as.Date("2024-03-20")) %>%
  select(geo_value, time_value, total_cases = value)

attributes(x)$metadata
## $geo_type
## [1] "state"
## 
## $time_type
## [1] "day"
## 
## $as_of
## [1] "2024-03-20"
## 
## $other_keys
## character(0)

Using additional key columns in epi_df

In the following examples we will show how to create an epi_df with additional keys.

Converting a tsibble that has county code as an extra key

ex1 <- tibble(
  geo_value = rep(c("ca", "fl", "pa"), each = 3),
  county_code = c(
    "06059", "06061", "06067",
    "12111", "12113", "12117",
    "42101", "42103", "42105"
  ),
  time_value = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = length(geo_value)),
  value = seq_along(geo_value) + 0.01 * withr::with_rng_version("3.0.0", withr::with_seed(42, length(geo_value)))
) %>%
  as_tsibble(index = time_value, key = c(geo_value, county_code))

ex1 <- as_epi_df(x = ex1, as_of = "2020-06-03")

The metadata now includes county_code as an extra key.

attr(ex1, "metadata")
## $geo_type
## [1] "state"
## 
## $time_type
## [1] "day"
## 
## $as_of
## [1] "2020-06-03"
## 
## $other_keys
## [1] "county_code"

Dealing with misspecified column names

epi_df requires there to be columns geo_value and time_value, if they do not exist then as_epi_df() throws an error.

data.frame(
  # misnamed
  state = rep(c("ca", "fl", "pa"), each = 3),
  # extra key
  pol = rep(c("blue", "swing", "swing"), each = 3),
  # misnamed
  reported_date = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = 9),
  value = 1:9 + 0.01 * withr::with_rng_version("3.0.0", withr::with_seed(42, 9))
) %>% as_epi_df(as_of = as.Date("2024-03-20"))
## Error in `guess_column_name()`:
## ! There is no time_value column or similar name. See e.g.
##   [`time_column_name()`] for a complete list

The columns can be renamed to match epi_df format. In the example below, notice there is also an additional key pol.

ex2 <- tibble(
  # misnamed
  state = rep(c("ca", "fl", "pa"), each = 3),
  # extra key
  pol = rep(c("blue", "swing", "swing"), each = 3),
  # misnamed
  reported_date = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = length(state)),
  value = seq_along(state) + 0.01 * withr::with_rng_version("3.0.0", withr::with_seed(42, length(state)))
) %>% data.frame()

head(ex2)
##   state   pol reported_date value
## 1    ca  blue    2020-06-01  1.09
## 2    ca  blue    2020-06-02  2.09
## 3    ca  blue    2020-06-03  3.09
## 4    fl swing    2020-06-01  4.09
## 5    fl swing    2020-06-02  5.09
## 6    fl swing    2020-06-03  6.09
ex2 <- ex2 %>%
  rename(geo_value = state, time_value = reported_date) %>%
  as_epi_df(
    as_of = "2020-06-03",
    other_keys = "pol"
  )

attr(ex2, "metadata")
## $geo_type
## [1] "state"
## 
## $time_type
## [1] "day"
## 
## $as_of
## [1] "2020-06-03"
## 
## $other_keys
## [1] "pol"

Adding additional keys to an epi_df object

In the above examples, all the keys are added to objects that are not epi_df objects. We illustrate how to add keys to an epi_df object.

We use a toy data set included in epiprocess prepared using the covidcast library and are filtering to a single state for simplicity.

ex3 <- jhu_csse_county_level_subset %>%
  filter(time_value > "2021-12-01", state_name == "Massachusetts") %>%
  slice_tail(n = 6)

attr(ex3, "metadata") # geo_type is county currently
## $geo_type
## [1] "county"
## 
## $time_type
## [1] "day"
## 
## $as_of
## [1] "2024-08-23 02:40:32 UTC"
## 
## $other_keys
## character(0)

Now we add state (MA) and pol as new columns to the data and as new keys to the metadata. Reminder that lower case state name abbreviations are what we would expect if this were a geo_value column.

ex3 <- ex3 %>%
  as_tibble() %>% # needed to add the additional metadata
  mutate(
    state = rep(tolower("MA"), 6),
    pol = rep(c("blue", "swing", "swing"), each = 2)
  ) %>%
  as_epi_df(other_keys = c("state", "pol"), as_of = as.Date("2024-03-20"))

attr(ex3, "metadata")
## $geo_type
## [1] "county"
## 
## $time_type
## [1] "day"
## 
## $as_of
## [1] "2024-03-20"
## 
## $other_keys
## [1] "state" "pol"

Note that the two additional keys we added, state and pol, are specified as a character vector in the other_keys argument. They must be specified in this manner so that downstream actions on the epi_df, like model fitting and prediction, can recognize and use these keys.

Currently other_keys metadata in epi_df doesn’t impact epi_slide(), contrary to other_keys in as_epi_archive which affects how the update data is interpreted.

Working with epi_df objects downstream

Data in epi_df format should be easy to work with downstream, since it is a very standard tabular data format; in the other vignettes, we’ll walk through some basic signal processing tasks using functions provided in the epiprocess package. Of course, we can also write custom code for other downstream uses, like plotting, which is pretty easy to do ggplot2.

library(ggplot2)
theme_set(theme_bw())

ggplot(x, aes(x = time_value, y = total_cases, color = geo_value)) +
  geom_line() +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  labs(x = "Date", y = "Cumulative COVID-19 cases", color = "State")

As a last couple examples, we’ll look at some more data sets just to show how we might get them into epi_df format. Data on daily new (not cumulative) SARS cases in Canada in 2003, from the outbreaks package:

x <- outbreaks::sars_canada_2003 %>%
  mutate(geo_value = "ca") %>%
  select(geo_value, time_value = date, starts_with("cases")) %>%
  as_epi_df(as_of = as.Date("2024-03-20"))

head(x)
## An `epi_df` object, 6 x 6 with metadata:
## * geo_type  = state
## * time_type = day
## * as_of     = 2024-03-20
## 
## # A tibble: 6 × 6
##   geo_value time_value cases_travel cases_household cases_healthcare cases_other
## * <chr>     <date>            <int>           <int>            <int>       <int>
## 1 ca        2003-02-23            1               0                0           0
## 2 ca        2003-02-24            0               0                0           0
## 3 ca        2003-02-25            0               0                0           0
## 4 ca        2003-02-26            0               1                0           0
## 5 ca        2003-02-27            0               0                0           0
## 6 ca        2003-02-28            1               0                0           0
library(tidyr)
x <- x %>%
  pivot_longer(starts_with("cases"), names_to = "type") %>%
  mutate(type = substring(type, 7))

yrange <- range(
  x %>%
    group_by(time_value) %>%
    summarize(value = sum(value)) %>%
    pull(value)
)

ggplot(x, aes(x = time_value, y = value)) +
  geom_col(aes(fill = type)) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  scale_y_continuous(breaks = yrange[1]:yrange[2]) +
  labs(x = "Date", y = "SARS cases in Canada", fill = "Type")

Get confirmed cases of Ebola in Sierra Leone from 2014 to 2015 by province and date of onset, prepared from line list data from the same package:

x <- outbreaks::ebola_sierraleone_2014 %>%
  select(district, date_of_onset, status) %>%
  mutate(province = case_when(
    district %in% c("Kailahun", "Kenema", "Kono") ~
      "Eastern",
    district %in% c(
      "Bombali", "Kambia", "Koinadugu", "Port Loko",
      "Tonkolili"
    ) ~
      "Northern",
    district %in% c("Bo", "Bonthe", "Moyamba", "Pujehun") ~
      "Sourthern",
    district %in% c("Western Rural", "Western Urban") ~
      "Western"
  )) %>%
  group_by(geo_value = province, time_value = date_of_onset) %>%
  summarise(cases = sum(status == "confirmed"), .groups = "drop") %>%
  complete(geo_value,
    time_value = full_seq(time_value, period = 1),
    fill = list(cases = 0)
  ) %>%
  as_epi_df(as_of = as.Date("2024-03-20"))

ggplot(x, aes(x = time_value, y = cases)) +
  geom_col(aes(fill = geo_value), show.legend = FALSE) +
  facet_wrap(~geo_value, scales = "free_y") +
  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
  labs(x = "Date", y = "Confirmed cases of Ebola in Sierra Leone")

Attribution

This document contains a dataset that is a modified part of the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University as republished in the COVIDcast Epidata API. This data set is licensed under the terms of the Creative Commons Attribution 4.0 International license by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.

From the COVIDcast Epidata API: These signals are taken directly from the JHU CSSE COVID-19 GitHub repository without changes.