Finding, fetching, and processing epidemiological data with {epidatr} and {epiprocess}


Delphi Research Group at CMU

Slides: Nat DeFries, Dmitry Shemetov, Logan Brooks, others on Delphi tooling team

CDC and MIDAS Forecasting Meeting — 21 November 2023

Slides are online at https://cmu-delphi.github.io/midas-cdc-2023-demo

The Delphi {epidatr} package is a new R front-end for the Delphi Epidata API

  • streamlines downloading and usage of data from the Delphi Epidata API
    • real-time access to epidemiological surveillance data for influenza, COVID-19, and other diseases
    • data from both official government sources such as the CDC and from private partners
    • a historical record of all data available, including corrections and updates, which is useful for backtesting of forecasting models.
  • provides a simple R interface to the API, with functions for downloading data, parsing results, and converting to tidy format.
    • the {epidatr} package is a complete rewrite of the {covidcast} package and delphi_epidata.R script, with a focus on speed, reliability, and ease of use
    • the {covidcast} package and delphi_epidata.R script are deprecated and will no longer be updated

Conveniently install in the normal ways

  • You can install the stable version of this package from CRAN:
install.packages("epidatr")
pak::pkg_install("epidatr")
renv::install("epidatr")
  • Or if you want the development version, install from GitHub:
pak::pkg_install("cmu-delphi/epidatr@dev")
remotes::install_github("cmu-delphi/epidatr", ref = "dev")
renv::install("cmu-delphi/epidatr@dev")
  • {epidatr} requires a (free) API key for full functionality
    • To generate your key, register for a pseudo-anonymous account (see the general API website for details) and use save_api_key() for help storing the key.
    • (Note: we also have private endpoints (those prefixed with pvt_) that require a separate key to be passed as an argument. These endpoints require data use agreements to access.)

:::

Example: HHS/NHSN hospitalization data

  • Fetch national COVID-19 hospital admissions:
epidata <- pub_covidcast(
  source = "hhs",
  signals = "confirmed_admissions_covid_1d",
  geo_type = "nation",
  time_type = "day",
  geo_values = "us",
  time_values = epirange("2023-01-01", "2023-06-01")
  # (by default, fetches the current version)
)
# `epidata` looks like:
# A tibble: 152 × 15
  signal   source geo_type time_type geo_value time_value issue        lag value
  <chr>    <chr>  <fct>    <fct>     <chr>     <date>     <date>     <int> <dbl>
1 confirm… hhs    nation   day       us        2023-01-01 2023-10-03   275  6078
2 confirm… hhs    nation   day       us        2023-01-02 2023-11-17   319  6727
3 confirm… hhs    nation   day       us        2023-01-03 2023-11-17   318  6932
4 confirm… hhs    nation   day       us        2023-01-04 2023-11-11   311  6693
5 confirm… hhs    nation   day       us        2023-01-05 2023-11-17   316  6609
# ℹ 147 more rows
# ℹ 6 more variables: direction <dbl>, missing_value <int>,
#   missing_stderr <int>, missing_sample_size <int>, stderr <dbl>,
#   sample_size <dbl>

Example: versioned HHS/NHSN hospitalization data

  • Fetch what the query should have looked like back in June (“as of” June 1st):
epidata <- pub_covidcast(
  source = "hhs",
  signals = "confirmed_admissions_covid_1d",
  geo_type = "nation",
  time_type = "day",
  geo_values = "us",
  time_values = epirange("2023-01-01", "2023-06-01"),
  as_of = "2023-06-01"
)
# `epidata` looks like:
# A tibble: 150 × 15
  signal   source geo_type time_type geo_value time_value issue        lag value
  <chr>    <chr>  <fct>    <fct>     <chr>     <date>     <date>     <int> <dbl>
1 confirm… hhs    nation   day       us        2023-01-01 2023-05-19   138  6058
2 confirm… hhs    nation   day       us        2023-01-02 2023-06-01   150  6713
3 confirm… hhs    nation   day       us        2023-01-03 2023-06-01   149  6893
4 confirm… hhs    nation   day       us        2023-01-04 2023-06-01   148  6657
5 confirm… hhs    nation   day       us        2023-01-05 2023-06-01   147  6587
# ℹ 145 more rows
# ℹ 6 more variables: direction <dbl>, missing_value <int>,
#   missing_stderr <int>, missing_sample_size <int>, stderr <dbl>,
#   sample_size <dbl>

Access other useful data, including Delphi-exclusive sources

See also covidcast_epidata() or the COVIDcast web site for a listing of other COVIDcast data available.

Access more than just COVID data!

Using avail_endpoints() you can find a listing of our other endpoints that serve a wide variety of public health data. Here we’ve filtered to non-COVID-specific data.

   Endpoint               Description                                
 1 pub_delphi()           Delphi's ILINet forecasts                  
 2 pub_dengue_nowcast()   Delphi's PAHO Dengue nowcast               
 3 pub_ecdc_ili()         ECDC ILI data                              
 4 pub_flusurv()          FluSurv hospitalization data               
 5 pub_fluview()          FluView ILINet data                        
 6 pub_fluview_clinical() FluView virological data from clinical labs
 7 pub_fluview_meta()     FluView metadata                           
 8 pub_gft()              Google Flu Trends data                     
 9 pub_kcdc_ili()         KCDC ILI data                              
10 pub_meta()             API metadata                               
11 pub_nidss_dengue()     NIDSS dengue data                          
12 pub_nidss_flu()        NIDSS flu data                             
13 pub_nowcast()          Delphi's ILI nowcast                       
14 pub_paho_dengue()      PAHO Dengue data                           
15 pub_wiki()             Wikipedia access data                      

Consider subscribing to the Delphi API mailing list to be notified of package updates, new data sources, corrections, and other updates

The {epiprocess} package helps work with epidemic datasets

pak::pkg_install("cmu-delphi/epiprocess@main")
  • provides common data structures for epidemiological data sets measured over space and time
  • provides utilities for basic signal processing tasks

epi_df: a snapshot of epidata in time

  • represents the most up-to-date values of dataset as of a given time
  • a subclassed tibble with two required columns: geo_value and time_value
  • and associated metadata: geo_type, time_type, other_keys, as_of
  • can have any number of other columns, which we call signal (or measured) variables

epi_df: a snapshot of epidata in time

Produce an epi_df from epidatr output like so:

tbl <- pub_covidcast(
  source = "hhs",
  signals = "confirmed_admissions_covid_1d",
  geo_type = "state",
  time_type = "day",
  geo_values = "ca,fl,ny,tx",
  time_values = "*"
)
epi_df <- tbl %>%
  dplyr::select(geo_value, time_value, admissions = value) %>%
  # Add NAs to fill gaps, cover same time range for each geo:
  tidyr::complete(geo_value, time_value = tidyr::full_seq(time_value, period = 1L)) %>%
  as_epi_df(
    geo_type = "state",
    time_type = "day",
    as_of = max(tbl$issue)
  )

epi_df: a snapshot of epidata in time

epi_df
An `epi_df` object, 5,644 x 3 with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2023-11-19

# A tibble: 5,644 × 3
   geo_value time_value admissions
 * <chr>     <date>          <dbl>
 1 ca        2019-12-31         NA
 2 ca        2020-01-01         NA
 3 ca        2020-01-02         NA
 4 ca        2020-01-03         NA
 5 ca        2020-01-04         NA
 6 ca        2020-01-05         NA
 7 ca        2020-01-06         NA
 8 ca        2020-01-07         NA
 9 ca        2020-01-08         NA
10 ca        2020-01-09         NA
# ℹ 5,634 more rows

epi_archive: a collection of historical epidata

  • represents the most up-to-date values of dataset as of various given times
  • required input columns: geo_value, time_value, version
  • can have any number of other key or signal/measured columns

epi_archive: a collection of historical epidata

tbl <- pub_covidcast(
  source = "hhs",
  signals = "confirmed_admissions_covid_1d",
  geo_type = "state",
  time_type = "day",
  geo_values = "ca,fl,ny,tx",
  time_values = "*", # "*" = all time values
  issues = epirange("1234-01-01", "2023-06-01") # start of range must be before data set start
)
epi_archive <- tbl %>%
  select(
    geo_value, time_value,
    version = issue, admissions = value
  ) %>%
  # don't try to `complete` here; `complete` after `epix_as_of` or inside `epix_slide` computations
  as_epi_archive(compactify = TRUE)

epi_archive: a collection of historical epidata

epi_archive
An `epi_archive` object, with metadata:
* geo_type  = state
* time_type = day
----------
* min time value = 2019-12-31
* max time value = 2023-05-30
* first version with update = 2020-11-16
* last version with update = 2023-06-01
* No clobberable versions
* versions end   = 2023-06-01
----------
Data archive (stored in DT field): 24625 x 4
Columns in DT: geo_value, time_value, version, admissions
----------
Public R6 methods: initialize, print, as_of, fill_through_version, 
                   truncate_versions_after, merge, group_by, slide, clone

epi_archive: a collection of historical epidata

epi_archive$DT
       geo_value time_value    version admissions
    1:        ca 2020-02-03 2020-11-16         NA
    2:        ca 2020-02-04 2020-11-16         NA
    3:        ca 2020-02-05 2020-11-16         NA
    4:        ca 2020-02-06 2020-11-16         NA
    5:        ca 2020-02-07 2020-11-16         NA
   ---                                           
24621:        tx 2023-05-28 2023-05-31         98
24622:        tx 2023-05-28 2023-06-01         85
24623:        tx 2023-05-29 2023-05-31         19
24624:        tx 2023-05-29 2023-06-01         96
24625:        tx 2023-05-30 2023-06-01         96

Some epi_slide use cases

  • Calculate rolling or running averages, sums, other statistics
  • Calculate custom growth rates, categorical trend definitions, smoothing (see also epiprocess::growth_rate, epipredict::step_lag_difference)
  • (Perform latency&revision-naive forecaster backtesting)

Some epix_as_of, epix_slide use cases

  • Better backtesting: generate pseudoprospective forecasts
  • Plot past forecast against data available at generation time
  • Plot evolution of how a time series was reported
  • Analyze reporting latency, revision behavior, trends
  • Improve forecasts when revisions are significant: prepare “version-analogous” training set predictor data

epi_df and epi_archive utilities

  • epi_df
    • group_by() - standard grouped operations
    • epi_slide() - perform (grouped) time-window computations on an epi_df
    • epi_cor() - compute correlations between variables in an epi_df
  • epi_archive
    • epix_merge() - merge/join two epi_archive objects
    • epix_as_of() - generate a snapshot epi_df from an epi_archive object
    • group_by() - standard grouped operations
    • epix_slide() - perform (grouped) time-windowed computations on several versions
  • And more, including outlier detection1, growth rate calculation.

Resources