Introduction to Panel Data in Epidemiology

InsightNet Forecasting Workshop 2024


Alice Cima, Rachel Lobay, Daniel McDonald, Ryan Tibshirani

with huge thanks to Logan Brooks, Xueda Shen, and also to Nat DeFries, Dmitry Shemetov, and David Weber

11 December – Morning

Outline

  1. The Delphi Research Group

  2. Workshop Overview and System Setup

  3. Panel Data

  4. Versioned Data

  5. Epidata Repository and API

  6. Find Data Sources and Signals

  7. {epidatr}

  8. Versioning in {epidatr}

1 The Delphi Research Group

About Delphi

  • Founded in 2012 at Carnegie Mellon University, now expanded to UC Berkeley, and University of British Columbia.

  • Currently 5 faculty, ~10 PhD students, ~15 staff (mostly software engineers).

  • Easy to join us from anywhere (lots of volunteers during Covid-19 pandemic).

  • We are:

    • CDC Center of Excellence for Influenza and Covid-19 Forecasting (2019-24).
    • CDC Innovation Center for Outbreak Analytics and Disease Modeling (2024-29).

Our mission: To develop the theory and practice of epidemic detection, tracking and forecasting, and their use in decision making, both public and private.

What does Delphi do?

  • Procure real-time, aggregated data streams informative of infectious diseases and syndromes, in collaboration with partners in industry and government.

  • Extract signals and make them widely available via the Epidata platform & API.

  • Develop and deploy algorithms for epidemic detection, tracking, forecasting.

  • Develop and maintain statistical software packages for these tasks.

  • Make it all production-grade, maximally-accessible, and open-source (to serve CDC, state and local public health agencies, epi-forecasting researchers, data journalists, the public)

What we provide

2 Workshop Overview and System Setup

What we will cover

  • Characteristics of panel data in epidemiology
  • Tools for processing and plotting panel data
  • Statistical background on nowcasting and forecasting
  • Tools for building nowcasting and forecasting models
  • Plenty of examples throughout of real case studies

Goals part I

  • Expose you to a statistical way of thinking about now/forecasting
  • Certain basic mindsets (e.g., the importance of empirical validation using techniques like time series cross-validation) are ubiquitous
  • Certain basic modeling considerations (e.g., starting simple and building up complexity, taming variance through regularization, addressing nonstationarity with trailing training windows) are also ubiquitous

Goals part II

  • Expose you to software packages which aid processing, tracking, nowcasting, and forecasting with panel data
  • These tools are still in development and we welcome your feedback
  • We have tried hard to get the framework right; but many individual pieces themselves could still be improved
  • If these aren’t working for you, then we want to hear from you!
  • We welcome collaboration, and everything we do is open source

A disclaimer

  • Our backgrounds are primarily in statistics and computer science
  • This obviously influences our way of thinking and our approach to nowcasting and forecasting
  • We don’t have nearly as much experience with traditional epi models but we do have opinions about the pros/cons. Ask us at any point if you have a question about why we’re doing things a certain way

One last slide

  • This workshop is supposed to be useful for YOU. Ask questions if you have them, don’t be shy
  • We may not (likely won’t?) cover everything. Hopefully the materials will be a resource for you beyond this workshop

System setup – Passive viewing


All of the slides are at


https://cmu-delphi.github.io/insightnet-workshop-2024


The source code is in the Repo


https://github.com/cmu-delphi/insightnet-workshop-2024


This is enough, but we hope you’ll want to work through the code as we go along.


Detailed versions of the next few slides are shown at the Repo Link above.

System setup – Required software


We assume you already have


  1. R


  1. An IDE. We’ll use RStudio, but you can use VSCode or Emacs or Whatnot

System setup – Downloading the materials

Easy way:

  1. Click the Big Green Button that says < > Code ▾
  2. Choose Download Zip
  3. Open the Zip directory and then Open insightnet-workshop-2024.Rproj

More expert (local git user):

  1. Click the Big Green Button that says < > Code ▾
  2. Copy the URL.
  3. Open RStudio, select File > New Project > Version Control. Paste there and proceed.

Even more expert (wants github remote):

  1. Click the Grey Button that says ⑂ Fork ▾
  2. Proceed along the same lines as above.

System setup – Installing required packages

We will use a lot of packages.

We’ve tried to make it so you can get them all at once (with the right versions)

🤞 We hope this works… 🤞 Note that you can “Copy to Clipboard”


In RStudio:

install.packages("pak") # good for installing from non-CRAN sources
pak::pkg_install("cmu-delphi/InsightNetFcast24", dependencies = TRUE)
InsightNetFcast24::verify_setup()


Hopefully, you see:

✔ You should be good to go!

Ask for help if you see something like:

Error in `verify_setup()`:
! The following packages do not have the correct version:
ℹ Installed: epipredict 0.2.0.
ℹ Required: epipredict == 0.1.5.

3 Panel Data

Panel data

  • Panel data or longitudinal data, contain cross-sectional measurements of subjects over time.

  • Since we’re working with aggregated data, the subjects are geographic units (e.g. counties, states).

  • In table form, panel data is a time index + one or more locations/keys.

  • Ex: The % of outpatient doctor visits that are COVID-related in CA from June 2020 to Dec. 2021 (docs):

# A tibble: 549 × 3
   time_value geo_value percent_cli
   <date>     <chr>           <dbl>
 1 2020-06-01 ca               2.75
 2 2020-06-02 ca               2.57
 3 2020-06-03 ca               2.48
 4 2020-06-04 ca               2.41
 5 2020-06-05 ca               2.57
 6 2020-06-06 ca               2.63
 7 2020-06-07 ca               2.73
 8 2020-06-08 ca               3.04
 9 2020-06-09 ca               2.97
10 2020-06-10 ca               2.99
# ℹ 539 more rows

Examples of panel data - COVID-19 cases

JHU CSSE COVID cases per 100k estimates the daily number of new confirmed COVID-19 cases per 100,000 population, averaged over the past 7 days.

Examples of panel data - HHS Admissions

Confirmed COVID-19 Hospital Admissions per 100k estimates the daily sum of adult and pediatric confirmed COVID-19 hospital admissions, per 100,000 population, averaged over the past 7 days.

4 Versioned Data

Intro to versioned data

  • Many epidemic aggregates are subject to reporting delays and revisions

  • This is because individual-level data has delayed availability:

Person comes to ER → Admitted → Has some tests → Tests come back → Entered into the system → …

  • So, a “Hospital admission” may not attributable to a particular condition until a few days have passed (the patient may even have been released)

  • Aggregated data have a longer pipeline from the incident to the report.

  • So we have to track both: when the event occurred and when it was reported

  • Additionally, various mistakes lead to revisions

  • This means there can be many different values for the same date

Versioned data

  • The event time is indicated by time_value (aka reference_date)

  • Now, we add a second time index to indicate the data version (aka reporting_date)

  • version = the time at which we saw a particular value associated to a time_value

# A tibble: 6 × 4
  time_value geo_value percent_cli version   
  <date>     <chr>           <dbl> <date>    
1 2020-06-01 ca               2.14 2020-06-06
2 2020-06-01 ca               2.14 2020-06-08
3 2020-06-01 ca               2.11 2020-06-09
4 2020-06-01 ca               2.13 2020-06-10
5 2020-06-01 ca               2.20 2020-06-11
6 2020-06-01 ca               2.23 2020-06-12
  • Note that this feature can be indicated in different ways (ex. version, issue, release, as_of).

Versioned panel data

Estimated percentage of outpatient visits due to CLI across multiple versions.

Latency and revision in signals

  • Latency the delay between data collection and availability

Example: A signal based on insurance claims may take several days to appear as claims are processed

  • Revision data is updated or corrected after initial publication

Example: COVID-19 case reports are revised reporting backlogs are cleared

Latency and revision in signals - Example

  • Recall the first example of panel & versioned data we’ve seen…
  • On June 1, this signal is 5 days latent: min(version - time_value)
# A tibble: 6 × 5
  time_value geo_value percent_cli version    version_time_diff
  <date>     <chr>           <dbl> <date>     <drtn>           
1 2020-06-01 ca               2.14 2020-06-06 5 days           
2 2020-06-02 ca               1.96 2020-06-06 4 days           
3 2020-06-03 ca               1.77 2020-06-06 3 days           
4 2020-06-04 ca               1.65 2020-06-08 4 days           
5 2020-06-05 ca               1.60 2020-06-09 4 days           
6 2020-06-06 ca               1.34 2020-06-10 4 days           

and subject to revision

# A tibble: 6 × 5
  time_value geo_value percent_cli version    version_time_diff
  <date>     <chr>           <dbl> <date>     <drtn>           
1 2020-06-01 ca               2.14 2020-06-06  5 days          
2 2020-06-01 ca               2.14 2020-06-08  7 days          
3 2020-06-01 ca               2.11 2020-06-09  8 days          
4 2020-06-01 ca               2.13 2020-06-10  9 days          
5 2020-06-01 ca               2.20 2020-06-11 10 days          
6 2020-06-01 ca               2.23 2020-06-12 11 days          

Revision triangle, Outpatient visits in WA 2022

  • 7-day trailing average to smooth day-of-week effects

Revisions

Many data sources are subject to revisions:

  • Case and death counts are frequently corrected or adjusted by authorities

  • Medical claims can take weeks to be submitted and processed

  • Lab tests and medical records can be backlogged

  • Surveys are not completed promptly

An accurate revision log is crucial for researchers building forecasts

Obvious but crucial

A forecast that is made today can only use data we have access to today

Three types of revisions

  1. Sources that don’t revise (provisional and final are the same)

Facebook Survey and Google symptoms

  1. Predictable revisions

Claims data (CHNG) and public health reports aligned by test, hospitalization, or death date

Almost always revised upward as additional claims enter the pipeline

  1. Revisions that are large and erratic to predict

COVID cases and deaths

These are aligned by report date

Types of revisions - Comparison between 2. and 3.

  • Revision behavior for two indicators in the HRR containing Charlotte, NC.
  • DV-CLI signal (left): regularly revised, but effects fade

  • JHU CSSE cases (right) remain “as first reported” until a major correction is made on Oct. 19

Key takeaways

Medical claims revisions
More systematic and predictable



COVID-19 case report revisions
Erratic and often unpredictable



Large spikes or anomalies can occur as
Reporting backlogs are cleared
Changes in case definitions are implemented

Reporting backlogs - Example

In Bexar County, Texas, during the summer of 2020…

  • Large backlog of case reports results in a spike
  • Auxilliary signals show no such dramatic increase
  • Reports themselves may not be trustworthy without context

Reporting backlogs - Key takeaways



  • Reporting issues common across U.S. jurisdictions



  • Audits regularly discovered misclassified or unreported cases and deaths



  • Cross-checking data with external sources from different reporting systems

5 Epidata Repository and API

What is the Epidata repository

Epidata: repository of aggregated epi-surveillance time series

Code is open-source. Signals can be either public or restricted.

  • To date, it has accumulated over 5 billion records.

  • At the peak of the pandemic, handled millions of API queries per day.

  • Many aren’t available elsewhere

Data from
public health reporting, medical insurance claims, medical device data, Google search queries, wastewater, app-based mobility patterns.


Added value
revision tracking, anomaly detection, trend detection, smoothing, imputation, geo-temporal-demographic disaggregation.

Goals of Delphi Epidata platform and repository

  1. Provide many aggregated epi-surveillance time-series (“epi-signals”)
    • Mirror signals from other sources, especially if revisions are not tracked
    • Be the national historical repository of record & preserve the raw data
  1. Be the go-to place for epi-signal discovery, including those held elsewhere

  2. Add value to existing signals and synthesize new ones

    • Via signal fusion, nowcasting, smoothing

Make epi-surveillance more nimble, complete, standardized, robust, and real-time

Features of Delphi Epidata

  • Built-in support for:

    1. Data revisions (“backfill”), including reporting dates and changes
    2. Geo levels w/ auto-aggregation (e.g. county, state, and nation) and specialized levels (e.g., DMA, sewer sheds)
    3. Demographic breakdown
    4. Representation for missingness and censoring
    5. Population sizes and fine-grained population density
  • Pre-computed smoothing and normalization (customization planned)

  • Access control

  • Code is Open Source.

  • Signals are as accessible (w/ API, SDK) as allowed by DUAs

Epidata Documentation


Delphi’s Epidata API real-time access to epidemiological surveillance data


The main endpoint (covidcast) daily updates about COVID-19 and influenza in the U.S.


A variety of other endpoints international historical data for COVID-19, influenza, dengue, norovirus

Some of our data sources

Ongoing Sources:

Insurance claims
%Covid {inpatient, outpatient}, by county x day
Google Symptom searches
7 symptoms groups, by county x day
Quidel/Ortho antigen tests
%Covid by age group x county x day
NCHS Deaths
all-cause, pneumonia, flu, Covid, by state x week
NSSP ED visits
%Covid, %flu, %RSV, by county x week (new!)
NWSS Covid
wastewater by sampling-site x day (in progress)

Some of our data sources

Active during pandemic, could be restarted for the next PHE:

HHS Hosp/ICU beds
Covid, flu, by {age-group x {state x day, facility x week}}
CTIS (“Delphi Facebook Survey”)
many dozens of questions, by county x day
STLT-reported
{cases, deaths} via {JHU, USAFacts}, by country x day
Safegraph mobility
misc measures by {county x day, county x week}

Severity pyramid

6 Find Data Sources & Signals

Finding data sources and signals of interest

Diverse Data Streams

  • Variety of Data: medical claims data, cases and deaths, mobility data
  • Geographic Coverage: includes multiple regions, making it comprehensive yet complex
  • Challenge: difficulty in pinpointing the specific data stream of interest

Using the Documentation

Docs are great for a deep dive into the data, while the apps & tools are useful to see what’s available…

Some tools to explore more easily

Signal discovery app, find available epi-signals in Delphi Epidata and elsewhere in the community


Signal visualization tool


Signal dashboard


“classic” map-based version visualize a core set of COVID-19 and flu indicators


Covidcast signal export app


Dashboard builder

7 {epidatr}

Installing {epidatr}

(you already did this, but just for posterity…)

Install the CRAN version

# Install the CRAN version
pak::pkg_install("epidatr")


or the development version

# Install the development version from the GitHub dev branch
pak::pkg_install("cmu-delphi/epidatr@dev")

The CRAN listing is here.

Python

In Python, install delphi-epidata from PyPI with

pip install delphi-epidata


delphi-epidata is soon to be replaced with epidatpy.

# Latest dev version
pip install -e "git+https://github.com/cmu-delphi/epidatpy.git#egg=epidatpy"

# PyPI version (not yet available)
pip install epidatpy

Using {epidatr} and {epidatpy}

library(epidatr)
hhs_flu_nc <- pub_covidcast(
  source = 'hhs', 
  signals = 'confirmed_admissions_influenza_1d', 
  geo_type = 'state', 
  time_type = 'day', 
  geo_values = 'nc',
  time_values = c(20240401, 20240405:20240414)
)
head(hhs_flu_nc, n = 3)
# A tibble: 3 × 15
  geo_value signal     source geo_type time_type time_value direction issue     
  <chr>     <chr>      <chr>  <fct>    <fct>     <date>         <dbl> <date>    
1 nc        confirmed… hhs    state    day       2024-04-01        NA 2024-04-22
2 nc        confirmed… hhs    state    day       2024-04-05        NA 2024-04-22
3 nc        confirmed… hhs    state    day       2024-04-06        NA 2024-04-22
# ℹ 7 more variables: lag <dbl>, missing_value <dbl>, missing_stderr <dbl>,
#   missing_sample_size <dbl>, value <dbl>, stderr <dbl>, sample_size <dbl>


Python equivalent:

res = Epidata.covidcast('hhs', 'confirmed_admissions_influenza_1d', 'day', 'state', [20240401, Epidata.range(20240405, 20240414)], 'nc')
print(res['result'], res['message'], len(res['epidata']))

API keys

  • Anyone may access the Epidata API anonymously without providing any personal data!!

  • Anonymous API access is subject to some restrictions: public datasets only; 60 requests per hour; only two parameters may have multiple selections

  • API key grants privileged access; can be obtained by registering with us

  • Privileges of registration: no rate limit; no limit on multiple selections

  • We just want to know which signals people care about and ensure we’re providing benefit

Tip

  • The {epidatr} client automatically searches for the key in the DELPHI_EPIDATA_KEY environment variable.
  • We recommend storing it in your .Renviron file, which R reads by default.
  • More on setting your API key here.

Interactive tooling in R

Find sources and signals in R?

Functions to enhance data discovery in {epidatr}:

avail_endpoints()
Lists all endpoints with brief descriptions
Highlights endpoints that cover non-US locations
avail_endpoints()
# A tibble: 28 × 2
   Endpoint                          Description                                
   <chr>                             <chr>                                      
 1 pub_covid_hosp_facility()         COVID hospitalizations by facility         
 2 pub_covid_hosp_facility_lookup()  Helper for finding COVID hospitalization f…
 3 pub_covid_hosp_state_timeseries() COVID hospitalizations by state            
 4 pub_covidcast()                   Various COVID and flu signals via the COVI…
 5 pub_covidcast_meta()              Metadata for the COVIDcast endpoint        
 6 pub_delphi()                      Delphi's ILINet outpatient doctor visits f…
 7 pub_dengue_nowcast()              Delphi's PAHO dengue nowcasts (North and S…
 8 pub_ecdc_ili()                    ECDC ILI incidence (Europe)                
 9 pub_flusurv()                     CDC FluSurv flu hospitalizations           
10 pub_fluview()                     CDC FluView ILINet outpatient doctor visits
11 pub_fluview_clinical()            CDC FluView flu tests from clinical labs   
12 pub_fluview_meta()                Metadata for the FluView endpoint          
13 pub_gft()                         Google Flu Trends flu search volume        
14 pub_kcdc_ili()                    KCDC ILI incidence (Korea)                 
15 pub_meta()                        Metadata for the Delphi Epidata API        
16 pub_nidss_dengue()                NIDSS dengue cases (Taiwan)                
17 pub_nidss_flu()                   NIDSS flu doctor visits (Taiwan)           
18 pub_nowcast()                     Delphi's ILI Nearby nowcasts               
19 pub_paho_dengue()                 PAHO dengue data (North and South America) 
20 pub_wiki()                        Wikipedia webpage counts by article        
21 pvt_cdc()                         CDC total and by topic webpage visits      
22 pvt_dengue_sensors()              PAHO dengue digital surveillance sensors (…
23 pvt_ght()                         Google Health Trends health topics search …
24 pvt_meta_norostat()               Metadata for the NoroSTAT endpoint         
25 pvt_norostat()                    CDC NoroSTAT norovirus outbreaks           
26 pvt_quidel()                      Quidel COVID-19 and influenza testing data 
27 pvt_sensors()                     Influenza and dengue digital surveillance …
28 pvt_twitter()                     HealthTweets total and influenza-related t…

Using the covidcast_epidata()

covidcast_epidata() details for signals at the COVIDcast endpoint

Assign to an object

cc_ed <- covidcast_epidata()
List data sources
cc_ed$sources, with tibbles describing the included signals
Editor Support
In RStudio or similar editors, use tab completion to explore:
cc_ed$source$ to view available data sources.
cc_ed$signals$ to see signal options with autocomplete assistance.
Filtering Convenience
Signals are prefixed with their source for easier navigation
cc_ed <- covidcast_epidata()
head(cc_ed$sources, n = 2) # head(list, n = 2) will print the first two elements of the list

Fetching data - COVIDcast main endpoint


pub_covidcast() accesses the covidcast endpoint

Need to specify the following arguments…

  1. source: Data source name
  2. signals: Signal name
  3. geo_type: Geographic level
  4. time_type: Time resolution
  5. geo_values: Location(s)
  6. time_values: times of interest

Fetching data - COVIDcast main endpoint

library(epidatr)
library(dplyr)

jhu_us_cases <- pub_covidcast(
  source = "jhu-csse",
  signals = "confirmed_7dav_incidence_prop", 
  geo_type = "nation",
  time_type = "day",
  geo_values = "us",
  time_values = epirange(20210101, 20210401)
)
# A tibble: 3 × 8
  geo_value signal             source geo_type time_value issue        lag value
  <chr>     <chr>              <chr>  <fct>    <date>     <date>     <dbl> <dbl>
1 us        confirmed_7dav_in… jhu-c… nation   2021-01-01 2023-03-10   798  61.9
2 us        confirmed_7dav_in… jhu-c… nation   2021-01-02 2023-03-10   797  64.2
3 us        confirmed_7dav_in… jhu-c… nation   2021-01-03 2023-03-10   796  67.1

value is the requested signal

  • the number of daily new confirmed COVID-19 cases per 100,000 population
  • from January to April 2021

Returned data - COVIDcast main endpoint

pub_covidcast() outputs a tibble, where each row represents one observation

Each observation is aggregated by time and by geographic region

  1. time_value: time period when the events occurred.
  2. geo_value: geographic region where the events occurred.
  3. value: estimated value.
  4. stderr: standard error of the estimate, usually referring to the sampling error.
  5. sample_size: number of events used in the estimation.

Returned data - COVIDcast main endpoint

Also reports

  • issue: The time this observation was published

  • lag: The period between when the events occurred and when the observation was published

Tracks the complete revision history of the signal

Allows for historical reconstructions of information that was available at a specific times

More on this soon!

Geographic levels

Signals are available at different geographic levels, depending on the endpoint

confirmed_7dav_incidence_prop is available by state

Change geo_type and geo_values in the previous example

jhu_state_cases <- pub_covidcast(
  source = "jhu-csse",
  signals = "confirmed_7dav_incidence_prop",
  geo_type = "state",
  time_type = "day",
  geo_values = "*",
  time_values = epirange(20210101, 20210401)
)
# A tibble: 6 × 8
  geo_value signal             source geo_type time_value issue        lag value
  <chr>     <chr>              <chr>  <fct>    <date>     <date>     <dbl> <dbl>
1 ak        confirmed_7dav_in… jhu-c… state    2021-01-01 2023-03-03   791  35.9
2 al        confirmed_7dav_in… jhu-c… state    2021-01-01 2023-03-03   791  67.7
3 ar        confirmed_7dav_in… jhu-c… state    2021-01-01 2023-03-03   791  76.2
4 as        confirmed_7dav_in… jhu-c… state    2021-01-01 2023-03-03   791   0  
5 az        confirmed_7dav_in… jhu-c… state    2021-01-01 2023-03-03   791  83.4
6 ca        confirmed_7dav_in… jhu-c… state    2021-01-01 2023-03-10   798 104. 

COVIDcast main endpoint - Example query

County geo_values are FIPS codes: Orange County, California.

jhu_county_cases <- pub_covidcast(
  source = "jhu-csse",
  signals = "confirmed_7dav_incidence_prop",
  geo_type = "county",
  time_type = "day",
  time_values = epirange(20210101, 20210401),
  geo_values = "06059"
)
# A tibble: 6 × 8
  geo_value signal             source geo_type time_value issue        lag value
  <chr>     <chr>              <chr>  <fct>    <date>     <date>     <dbl> <dbl>
1 06059     confirmed_7dav_in… jhu-c… county   2021-01-01 2023-03-03   791  105.
2 06059     confirmed_7dav_in… jhu-c… county   2021-01-02 2023-03-03   790  107.
3 06059     confirmed_7dav_in… jhu-c… county   2021-01-03 2023-03-03   789  108.
4 06059     confirmed_7dav_in… jhu-c… county   2021-01-04 2023-03-03   788  107.
5 06059     confirmed_7dav_in… jhu-c… county   2021-01-05 2023-03-03   787  105.
6 06059     confirmed_7dav_in… jhu-c… county   2021-01-06 2023-03-03   786  104.

The covidcast endpoint supports * in its time and geo fields.

Signal values for all available counties: replace geo_values = "06059" with geo_values = "*".

Example queries - Other endpoints: Hospitalizations

COVID-19 Hospitalization: Facility Lookup

API docs: https://cmu-delphi.github.io/delphi-epidata/api/covid_hosp_facility_lookup.html

pub_covid_hosp_facility_lookup(city = "southlake")
# A tibble: 2 × 10
  hospital_pk state ccn    hospital_name    address city  zip   hospital_subtype
  <chr>       <chr> <chr>  <chr>            <chr>   <chr> <chr> <chr>           
1 450888      TX    450888 TEXAS HEALTH HA… 1545 E… SOUT… 76092 Short Term      
2 670132      TX    670132 METHODIST SOUTH… 421 E … SOUT… 76092 Short Term      
# ℹ 2 more variables: fips_code <chr>, is_metro_micro <dbl>
pub_covid_hosp_facility_lookup(state = "WY") |> head()
# A tibble: 6 × 10
  hospital_pk     state ccn   hospital_name address city  zip   hospital_subtype
  <chr>           <chr> <chr> <chr>         <chr>   <chr> <chr> <chr>           
1 100 LANCASTER … WY    2020… 42091         <NA>    [C39… MAIN  390195          
2 2333 BIDDLE AVE WY    2020… 26163         POINT … [C23… HENRY 230146          
3 2333 BIDDLE AV… WY    2020… 26163         POINT … [C23… SELEC 232031          
4 2752 CENTURY B… WY    2020… 42011         POINT … [C39… SURGI 390316          
5 310 SOUTH FALL… WY    2020… 05037         POINT … [C04… CROSS 041307          
6 5200 FAIRVIEW … WY    2020… 27025         POINT … [C24… FAIRV 240050          
# ℹ 2 more variables: fips_code <chr>, is_metro_micro <dbl>
# A non-example (there is no city called New York in Wyoming)
# pub_covid_hosp_facility_lookup(state = "WY", city = "New York")

Example queries - Other endpoints: Hospitalizations

COVID-19 Hospitalization by Facility

API docs: https://cmu-delphi.github.io/delphi-epidata/api/covid_hosp_facility.html

pub_covid_hosp_facility(
  hospital_pks = "100075",
  collection_weeks = epirange(20200101, 20200501)
) |> head()
# A tibble: 6 × 113
  hospital_pk state ccn    hospital_name    address city  zip   hospital_subtype
  <chr>       <chr> <chr>  <chr>            <chr>   <chr> <chr> <chr>           
1 100075      FL    100075 ST JOSEPHS HOSP… 3001 W… TAMPA 33677 Short Term      
2 100075      FL    100075 ST JOSEPHS HOSP… 3001 W… TAMPA 33677 Short Term      
3 100075      FL    100075 ST JOSEPHS HOSP… 3001 W… TAMPA 33677 Short Term      
4 100075      FL    100075 ST JOSEPHS HOSP… 3001 W… TAMPA 33677 Short Term      
5 100075      FL    100075 ST JOSEPHS HOSP… 3001 W… TAMPA 33677 Short Term      
6 100075      FL    100075 ST JOSEPHS HOSP… 3001 W… TAMPA 33677 Short Term      
# ℹ 105 more variables: fips_code <chr>, geocoded_hospital_address <chr>,
#   hhs_ids <chr>, publication_date <date>, collection_week <date>,
#   is_metro_micro <lgl>, total_beds_7_day_sum <dbl>,
#   all_adult_hospital_beds_7_day_sum <dbl>,
#   all_adult_hospital_inpatient_beds_7_day_sum <dbl>,
#   inpatient_beds_used_7_day_sum <dbl>,
#   all_adult_hospital_inpatient_bed_occupied_7_day_sum <dbl>, …

Example queries - Other endpoints: Hospitalizations

COVID-19 Hospitalization by State

API docs: https://cmu-delphi.github.io/delphi-epidata/api/covid_hosp.html

pub_covid_hosp_state_timeseries(states = "MA", dates = "20200510")
# A tibble: 1 × 118
  state geocoded_state issue      date       critical_staffing_shortage_today_…¹
  <chr> <lgl>          <date>     <date>     <lgl>                              
1 MA    NA             2024-05-03 2020-05-10 FALSE                              
# ℹ abbreviated name: ¹​critical_staffing_shortage_today_yes
# ℹ 113 more variables: critical_staffing_shortage_today_no <lgl>,
#   critical_staffing_shortage_today_not_reported <lgl>,
#   critical_staffing_shortage_anticipated_within_week_yes <lgl>,
#   critical_staffing_shortage_anticipated_within_week_no <lgl>,
#   critical_staffing_shortage_anticipated_within_week_not_reported <lgl>,
#   hospital_onset_covid <dbl>, hospital_onset_covid_coverage <dbl>, …

Example queries - Other endpoints: Flu endpoints

FluSurv hospitalization data – Data ends around 2020

API docs: https://cmu-delphi.github.io/delphi-epidata/api/flusurv.html

pub_flusurv(locations = "ca", epiweeks = 202001) 

Fluview data – Remains active

API docs: https://cmu-delphi.github.io/delphi-epidata/api/fluview.html

pub_fluview(regions = "nat", epiweeks = epirange(201201, 202001))

Public vs private endpoints

Public endpoints are accessed with functions starting with pub_

Private data can be used with pvt_ for authorized API keys

Store the key in your .Reviron file, or set is as an environment variables

Examples

Signal metadata

Some endpoints provide additional metadata

  • Time Information: available time frames and most recent update
  • Geography Information: available geographies

Metadata accessors

  • pub_covidcast_meta(): metadata for COVIDcast
  • pub_fluview_meta(): metadata for FluView
  • pub_meta(): general metadata for the Delphi Epidata API

8 Versioning in {epidatr}

Versioned data in {epidatr}

Epidata API contains each signal’s estimate, location, date, and update timeline

Requesting Specific Data Versions:

  • Use as_of or issues to specify data availability
  • as_of always fetches one version
  • issues can fetch multiple
  • Only one may be used at a time
  • Not all endpoints support both

Obtaining data “as of” a specific date

Doctor Visits (from the covidcast endpoint)

  • The percentage of outpatient visits w/ Covid-like illness
  • Pennsylvania on May 1, 2020:
dv_pa_as_of <- pub_covidcast(
  source = "doctor-visits",
  signals = "smoothed_adj_cli",
  time_type = "day",
  time_values = "2020-05-01",
  geo_type = "state",
  geo_values = "pa",
  as_of = "2020-05-07"
)
# A tibble: 1 × 7
  geo_value signal           source        time_value issue        lag value
  <chr>     <chr>            <chr>         <date>     <date>     <dbl> <dbl>
1 pa        smoothed_adj_cli doctor-visits 2020-05-01 2020-05-07     6  2.58
  • Initial estimate issued on May 7, 2020
  • Due to delay from reporting and ingestion by the API

Obtaining data “as of” a specific date

Default behaviour: unspecified as_of, get the most recent data

dv_pa_final <- pub_covidcast(
  source = "doctor-visits",
  signals = "smoothed_adj_cli",
  time_type = "day",
  time_values = "2020-05-01",
  geo_type = "state",
  geo_values = "pa"
)
# A tibble: 1 × 8
  geo_value signal           source     time_value issue        lag value stderr
  <chr>     <chr>            <chr>      <date>     <date>     <dbl> <dbl>  <dbl>
1 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-04    64  5.97     NA

Estimate changed substantially:

  • Increased to ~6% from <3%

Versioning is important for forecasting


  • Backtesting requires using data that would have been available at the time


  • Not later updates


  • Overly optimistic

Obtaining multiple specific issues for one state

Request all issues in a certain time period

dv_pa_issues <- pub_covidcast(
  source = "doctor-visits",
  signals = "smoothed_adj_cli",
  time_type = "day",
  time_values = "2020-05-01",
  geo_type = "state",
  geo_values = "pa",
  issues = epirange("2020-05-01", "2020-05-15")
)
# A tibble: 6 × 8
  geo_value signal           source     time_value issue        lag value stderr
  <chr>     <chr>            <chr>      <date>     <date>     <dbl> <dbl>  <dbl>
1 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-07     6  2.58     NA
2 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-08     7  3.28     NA
3 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-09     8  3.32     NA
4 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-12    11  3.59     NA
5 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-13    12  3.63     NA
6 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-14    13  3.66     NA

Obtaining multiple issues for one state

To get all issues up to a specific date, set an extreme lower bound

dv_pa_issues_sub <- pub_covidcast(
  source = "doctor-visits",
  signals = "smoothed_adj_cli",
  time_type = "day",
  time_values = "2020-05-01",
  geo_type = "state",
  geo_values = "pa",
  issues = epirange("1900-01-01", "2020-05-15")
)
# A tibble: 6 × 8
  geo_value signal           source     time_value issue        lag value stderr
  <chr>     <chr>            <chr>      <date>     <date>     <dbl> <dbl>  <dbl>
1 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-07     6  2.58     NA
2 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-08     7  3.28     NA
3 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-09     8  3.32     NA
4 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-12    11  3.59     NA
5 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-13    12  3.63     NA
6 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-14    13  3.66     NA

No change here • Can matter if the latency or reporting lag is unknown

API docs show the earliest date available.

Obtaining multiple issues for one state

At some point, nothing changes • It is finalized • That will be the “last” issue

dv_pa_issues_all <- pub_covidcast(
  source = "doctor-visits",
  signals = "smoothed_adj_cli",
  time_type = "day",
  time_values = "2020-05-01",
  geo_type = "state",
  geo_values = "pa",
  issues = epirange("1900-01-01", "2024-12-11") # From the 1900s to today
)
# A tibble: 6 × 8
  geo_value signal           source     time_value issue        lag value stderr
  <chr>     <chr>            <chr>      <date>     <date>     <dbl> <dbl>  <dbl>
1 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-06-29    59  5.99     NA
2 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-06-30    60  5.99     NA
3 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-01    61  5.95     NA
4 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-02    62  5.97     NA
5 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-03    63  5.97     NA
6 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-04    64  5.97     NA
  • Avoid queries with too-late minimum too-early maximum issue
  • Could be misleading results

Obtaining all issues for one state

dv_pa_issues_star <- pub_covidcast(
  source = "doctor-visits",
  signals = "smoothed_adj_cli",
  time_type = "day",
  time_values = epirange("2020-05-01", "2020-05-07"),
  geo_type = "state",
  geo_values = "pa",
  issues = "*"
)
# A tibble: 8 × 8
  geo_value signal           source     time_value issue        lag value stderr
  <chr>     <chr>            <chr>      <date>     <date>     <dbl> <dbl>  <dbl>
1 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-07     6  2.58     NA
2 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-08     7  3.28     NA
3 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-09     8  3.32     NA
4 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-12    11  3.59     NA
5 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-13    12  3.63     NA
6 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-14    13  3.66     NA
7 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-15    14  3.66     NA
8 pa        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-16    15  3.61     NA

Obtaining all issues for all states

Using * gives all available

dv_state_issues_star <- pub_covidcast(
  source = "doctor-visits",
  signals = "smoothed_adj_cli",
  time_type = "day",
  time_values = epirange("2020-05-01", "2020-05-07"),
  geo_type = "state",
  geo_values = "*",
  issues = "*"
)
# A tibble: 6 × 8
  geo_value signal           source     time_value issue        lag value stderr
  <chr>     <chr>            <chr>      <date>     <date>     <dbl> <dbl>  <dbl>
1 ak        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-07     6  1.61     NA
2 ak        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-08     7  2.40     NA
3 ak        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-09     8  2.38     NA
4 ak        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-12    11  2.38     NA
5 ak        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-13    12  2.36     NA
6 ak        smoothed_adj_cli doctor-vi… 2020-05-01 2020-05-14    13  2.36     NA

Obtaining one issue for all states

Defaults are intended to be “what you would expect”

dv_state_default <- pub_covidcast(
  source = "doctor-visits",
  signals = "smoothed_adj_cli",
  time_type = "day",
  time_values = epirange("2020-05-01", "2020-05-07"),
  geo_type = "state"
)
# A tibble: 6 × 8
  geo_value signal           source     time_value issue        lag value stderr
  <chr>     <chr>            <chr>      <date>     <date>     <dbl> <dbl>  <dbl>
1 ak        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-04    64  5.72     NA
2 al        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-04    64  2.74     NA
3 ar        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-04    64  4.23     NA
4 az        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-04    64  2.78     NA
5 ca        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-04    64  4.25     NA
6 co        smoothed_adj_cli doctor-vi… 2020-05-01 2020-07-04    64  8.77     NA

  • most recent issue
  • all states

Main takeaways

  • Delphi Epidata: platform for real-time epidemic data
    • provides (aggregated) signals for tracking and forecasting
    • sources like health records, mobility patterns, and more.
  • Epidata API: delivers up-to-date, granular epidemiological data + historical versions.
  • {epidatr}: Client package for R
  • Versioned Data and Latency:
    1. as_of: One version; the specific date when the data was last updated
    2. issues: Multiple versions; with different as_of dates

Manages the record of revisions for transparency and accuracy in data analysis.