The epi_df
data structure provided by
epiprocess
provides convenient ways to perform common
processing tasks. In this vignette, we will:
- construct an
epi_df
from a data frame - perform rolling time-window computations using
epi_slide()
- perform group-level aggregation using
sum_groups_epi_df()
- detect and fill time gaps using
complete.epi_df()
and tsibble - perform geographic aggregation (not yet implemented)
Getting data into epi_df
format
As in vignette("epiprocess")
, we will fetch daily
reported COVID-19 cases from CA, FL, NY, and TX (note: here we’re using
new, not cumulative cases) using the epidatr
package, and then convert this to epi_df
format.
The data is included in this package (via the epidatasets
package) and can be loaded with:
edf <- cases_deaths_subset %>%
select(geo_value, time_value, cases) %>%
arrange(geo_value, time_value)
The data can also be fetched from the Delphi Epidata API with the following query:
library(epidatr)
d <- as.Date("2024-03-20")
edf <- pub_covidcast(
source = "jhu-csse",
signals = "confirmed_incidence_num",
geo_type = "state",
time_type = "day",
geo_values = "ca,fl,ny,tx,ga,pa",
time_values = epirange(20200301, 20211231),
as_of = d
) %>%
select(geo_value, time_value, cases = value) %>%
arrange(geo_value, time_value) %>%
as_epi_df(as_of = d)
The data has 2,684 rows and 3 columns.
Rolling computations using epi_slide
A very common operation in time series processing is aggregating the
values of the time series by applying some function on a rolling time
window of data points. The key tool that allows this is
epi_slide()
. The function always first makes sure to group
the data by the grouping variables of the epi_df
object,
which includes the geo_value
and possibly
other_keys
columns. It then applies the rolling slide
computation inside each group.
The epi_slide()
function has three ways to specify the
computation to be performed:
- by using a tidy evaluation approach
- by passing a formula
- by passing a function
Slide the tidy way
Usually, the most convenient way to setup a computation in
epi_slide()
is to pass in an expression for tidy
evaluation. In this case, we can simply define the name of the new
column directly as part of the expression, setting it equal to a
computation in which we can access any columns of .x
by
name, just as we would in a call to, say, dplyr::mutate()
.
For example:
As a simple sanity check, we visualize the 7-day trailing averages computed on top of the original counts:
library(ggplot2)
ggplot(slide_output, aes(x = time_value)) +
geom_col(aes(y = cases, fill = geo_value), alpha = 0.5, show.legend = FALSE) +
geom_line(aes(y = cases_7sd, col = geo_value), show.legend = FALSE) +
facet_wrap(~geo_value, scales = "free_y") +
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
labs(x = "Date", y = "Reported COVID-19 cases")
As we can see from the Texas plot, the state moved to weekly reporting of COVID-19 cases in summer of 2021.
Note that without epi_slide()
, the computation is much
less convenient. For instance, a rough equivalent of the above
computation would be the following, which is easy to get wrong:
edf %>%
complete(geo_value, time_value = seq.Date(min(time_value), max(time_value), by = "day")) %>%
arrange_canonical() %>%
group_by(geo_value) %>%
mutate(cases_7sd = slider::slide_dbl(cases, .f = sd, na.rm = TRUE, .before = 7, .after = 0))
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 4
#> # Groups: geo_value [6]
#> geo_value time_value cases cases_7sd
#> * <chr> <date> <dbl> <dbl>
#> 1 ca 2020-03-01 6 NA
#> 2 ca 2020-03-02 4 1.41
#> 3 ca 2020-03-03 6 1.15
#> 4 ca 2020-03-04 11 2.99
#> 5 ca 2020-03-05 10 2.97
#> 6 ca 2020-03-06 18 5.08
#> # ℹ 4,020 more rows
Furthermore epi_slide()
allows for selecting
.ref_time_value
, which the latter recipe does not
support.
Slide with a function
We can also pass a function to the second argument in
epi_slide()
. In this case, the passed function
.f
must have the form function(x, g, t, ...)
,
where
-
x
is an epi_df with the same column names as the input epi_df -
g
is a one-row tibble containing the values of the grouping variables for the associated group, for instanceg$geo_value
-
t
is the ref_time_value for the current window -
...
are additional arguments
The same computation as above can be done with a function:
edf %>%
epi_slide(.f = function(x, g, t) sd(x$cases, na.rm = TRUE), .window_size = 7)
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 4
#> geo_value time_value cases slide_value
#> * <chr> <date> <dbl> <dbl>
#> 1 ca 2020-03-01 6 NA
#> 2 ca 2020-03-02 4 1.41
#> 3 ca 2020-03-03 6 1.15
#> 4 ca 2020-03-04 11 2.99
#> 5 ca 2020-03-05 10 2.97
#> 6 ca 2020-03-06 18 5.08
#> # ℹ 4,020 more rows
epi_slide()
with a formula
The same computation as above can be done with a formula, where all
references to the columns must be made with the prefix
.x$...
, for instance:
edf %>%
epi_slide(~ sd(.x$cases, na.rm = TRUE), .window_size = 7)
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 4
#> geo_value time_value cases slide_value
#> * <chr> <date> <dbl> <dbl>
#> 1 ca 2020-03-01 6 NA
#> 2 ca 2020-03-02 4 1.41
#> 3 ca 2020-03-03 6 1.15
#> 4 ca 2020-03-04 11 2.99
#> 5 ca 2020-03-05 10 2.97
#> 6 ca 2020-03-06 18 5.08
#> # ℹ 4,020 more rows
Note that the name of the column defaults to slide_value
in the unnamed formula or function case. This can be adjusted with
.new_col_name
.
Rolling computations with multiple column outputs
If your formula (or function) returns a data.frame, then the columns
of the data.frame will be unpacked into the resulting
epi_df
(in the sense of tidyr::unpack()
). For
example, the following computes the 7-day trailing average of daily
cases as well as the the 7-day trailing standard deviation of daily
cases:
edf %>%
epi_slide(
~ data.frame(cases_mean = mean(.x$cases, na.rm = TRUE), cases_sd = sd(.x$cases, na.rm = TRUE)),
.window_size = 7
)
#> An `epi_df` object, 4,026 x 5 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 5
#> geo_value time_value cases cases_mean cases_sd
#> * <chr> <date> <dbl> <dbl> <dbl>
#> 1 ca 2020-03-01 6 6 NA
#> 2 ca 2020-03-02 4 5 1.41
#> 3 ca 2020-03-03 6 5.33 1.15
#> 4 ca 2020-03-04 11 6.75 2.99
#> 5 ca 2020-03-05 10 7.4 2.97
#> 6 ca 2020-03-06 18 9.17 5.08
#> # ℹ 4,020 more rows
Optimized rolling mean and sums
For the two most common sliding operations, we offer two optimized
versions: epi_slide_mean()
and
epi_slide_sum()
. These are much faster than
epi_slide()
, so we recommend using them when you are only
interested in the mean or sum of a column. The following computes the
7-day trailing mean of daily cases:
edf %>%
group_by(geo_value) %>%
epi_slide_mean("cases", .window_size = 7, na.rm = TRUE)
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 4
#> # Groups: geo_value [6]
#> geo_value time_value cases cases_7dav
#> * <chr> <date> <dbl> <dbl>
#> 1 ca 2020-03-01 6 6
#> 2 ca 2020-03-02 4 5
#> 3 ca 2020-03-03 6 5.33
#> 4 ca 2020-03-04 11 6.75
#> 5 ca 2020-03-05 10 7.4
#> 6 ca 2020-03-06 18 9.17
#> # ℹ 4,020 more rows
edf %>%
group_by(geo_value) %>%
epi_slide_sum("cases", .window_size = 7, na.rm = TRUE)
#> An `epi_df` object, 4,026 x 4 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 4,026 × 4
#> # Groups: geo_value [6]
#> geo_value time_value cases cases_7dsum
#> * <chr> <date> <dbl> <dbl>
#> 1 ca 2020-03-01 6 6
#> 2 ca 2020-03-02 4 10
#> 3 ca 2020-03-03 6 16
#> 4 ca 2020-03-04 11 27
#> 5 ca 2020-03-05 10 37
#> 6 ca 2020-03-06 18 55
#> # ℹ 4,020 more rows
Running a forecaster on a sliding window of data
The natural next step is to use the sliding window to forecast future
values. However to do this correctly, we should make sure that our data
is historically accurate. The data structure we use for that is the
epi_archive
and the analogous slide function is
epix_slide()
. To read further along this train of thought,
see vignette("epi_archive")
.
Adding more keys to an epi_df
and aggregating groups
with sum_groups_epi_df
An epi_df
object can have more key columns than just
geo_value
and time_value
. For example, if we
have demographic attributes like age group, we can add this as a key
column. We can then aggregate the data by these key columns using
sum_groups_epi_df()
. Let’s use influenza hospitalization
rate data from the CDC system FluSurv as an example. We can get it from
the Delphi
Epidata API
library(epidatr)
flu_data <- pub_flusurv(
locations = "ca",
epiweeks = epirange(201801, 202001),
) %>%
select(location, epiweek, issue, rate_age_0, rate_age_1, rate_age_2, rate_age_3, rate_age_4) %>%
tidyr::pivot_longer(cols = starts_with("rate_age_"), names_to = "age_group", values_to = "rate")
flu_data
#> # A tibble: 305 × 5
#> location epiweek issue age_group rate
#> <chr> <date> <date> <chr> <dbl>
#> 1 CA 2017-12-31 2018-12-30 rate_age_0 4.4
#> 2 CA 2017-12-31 2018-12-30 rate_age_1 1.7
#> 3 CA 2017-12-31 2018-12-30 rate_age_2 2.7
#> 4 CA 2017-12-31 2018-12-30 rate_age_3 18.7
#> 5 CA 2017-12-31 2018-12-30 rate_age_4 99.7
#> 6 CA 2018-01-07 2019-01-06 rate_age_0 4.4
#> # ℹ 299 more rows
We can now convert this data to an epi_df
object and set
the age_group
column as an additional group key:
flu_data <- flu_data %>% as_epi_df(other_keys = "age_group", as_of = as.Date("2024-03-20"))
#> inferring time_value column.
#> inferring geo_value column.
flu_data
#> An `epi_df` object, 305 x 5 with metadata:
#> * geo_type = state
#> * time_type = week
#> * other_keys = age_group
#> * as_of = 2024-03-20
#>
#> # A tibble: 305 × 5
#> geo_value age_group time_value issue rate
#> * <chr> <chr> <date> <date> <dbl>
#> 1 CA rate_age_0 2017-12-31 2018-12-30 4.4
#> 2 CA rate_age_1 2017-12-31 2018-12-30 1.7
#> 3 CA rate_age_2 2017-12-31 2018-12-30 2.7
#> 4 CA rate_age_3 2017-12-31 2018-12-30 18.7
#> 5 CA rate_age_4 2017-12-31 2018-12-30 99.7
#> 6 CA rate_age_0 2018-01-07 2019-01-06 4.4
#> # ℹ 299 more rows
Note that the epi_df
object now has an additional key
column age_group
. This means that there should only be one
row for each combination of geo_value
,
time_value
, and age_group
in the dataset (this
is enforced at construction time).
Now we can aggregate the data by age_group
, if we want
to compute the total:
group_cols <- key_colnames(exclude = "age_group")
flu_data %>%
sum_groups_epi_df("rate", group_cols = group_cols)
#> An `epi_df` object, 61 x 3 with metadata:
#> * geo_type = custom
#> * time_type = week
#> * as_of = 2024-03-20
#>
#> # A tibble: 61 × 3
#> geo_value time_value rate
#> * <chr> <date> <dbl>
#> 1 total 2017-12-31 127.
#> 2 total 2018-01-07 76.7
#> 3 total 2018-01-14 57.8
#> 4 total 2018-01-21 37.9
#> 5 total 2018-01-28 30
#> 6 total 2018-02-04 22.8
#> # ℹ 55 more rows
Detecting and filling time gaps with
complete.epi_df
Sometimes you may have missing data in your time series. This can be
due to actual missing data, or it can be due to the fact that the data
is only reported on certain days. In the latter case, it is often useful
to fill in the missing data with explicit zeros. This can be done with
the complete.epi_df()
function.
First, let’s create a data set with some missing data. We will reuse
the dataset edf
from above, but modify it slightly.
edf_missing <- edf %>%
filter(geo_value %in% c("ca", "tx")) %>%
group_by(geo_value) %>%
slice(1:3, 5:6)
edf_missing %>%
print(n = 10)
#> An `epi_df` object, 10 x 3 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 10 × 3
#> # Groups: geo_value [2]
#> geo_value time_value cases
#> * <chr> <date> <dbl>
#> 1 ca 2020-03-01 6
#> 2 ca 2020-03-02 4
#> 3 ca 2020-03-03 6
#> 4 ca 2020-03-05 10
#> 5 ca 2020-03-06 18
#> 6 tx 2020-03-01 0
#> 7 tx 2020-03-02 0
#> 8 tx 2020-03-03 0
#> 9 tx 2020-03-05 3
#> 10 tx 2020-03-06 1
Now let’s fill in the missing data with explicit zeros:
edf_missing %>%
complete(
time_value = seq.Date(min(time_value), max(time_value), by = "day"),
fill = list(cases = 0)
) %>%
print(n = 12)
#> An `epi_df` object, 12 x 3 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 12 × 3
#> # Groups: geo_value [2]
#> geo_value time_value cases
#> * <chr> <date> <dbl>
#> 1 ca 2020-03-01 6
#> 2 ca 2020-03-02 4
#> 3 ca 2020-03-03 6
#> 4 ca 2020-03-04 0
#> 5 ca 2020-03-05 10
#> 6 ca 2020-03-06 18
#> 7 tx 2020-03-01 0
#> 8 tx 2020-03-02 0
#> 9 tx 2020-03-03 0
#> 10 tx 2020-03-04 0
#> 11 tx 2020-03-05 3
#> 12 tx 2020-03-06 1
Detecting and filling time gaps with tsibble
We can also use the tsibble
package to detect and fill
time gaps. We’ll work with county-level reported COVID-19 cases in MA
and VT.
The data is included in this package (via the epidatasets
package) and can be loaded with:
The data can also be fetched from the Delphi Epidata API with the following query:
library(epidatr)
d <- as.Date("2024-03-20")
# Get mapping between FIPS codes and county&state names:
y <- read_csv("https://github.com/cmu-delphi/covidcast/raw/c89e4d295550ba1540d64d2cc991badf63ad04e5/Python-packages/covidcast-py/covidcast/geo_mappings/county_census.csv", # nolint: line_length_linter
col_types = c(
FIPS = col_character(),
CTYNAME = col_character(),
STNAME = col_character()
)
) %>%
filter(STNAME %in% c("Massachusetts", "Vermont"), STNAME != CTYNAME) %>%
select(geo_value = FIPS, county_name = CTYNAME, state_name = STNAME)
# Fetch only counties from Massachusetts and Vermont, then append names columns as well
x <- pub_covidcast(
source = "jhu-csse",
signals = "confirmed_incidence_num",
geo_type = "county",
time_type = "day",
geo_values = paste(y$geo_value, collapse = ","),
time_values = epirange(20200601, 20211231),
as_of = d
) %>%
select(geo_value, time_value, cases = value) %>%
inner_join(y, by = "geo_value", relationship = "many-to-one", unmatched = c("error", "drop")) %>%
as_epi_df(as_of = d)
The data contains 16,212 rows and 5 columns.
Converting to tsibble
format
For manipulating and wrangling time series data, the tsibble
already provides a host of useful tools. A tsibble object (formerly, of
class tbl_ts
) is basically a tibble (data frame) but with
two specially-marked columns: an index column
representing the time variable (defining an order from past to present),
and a key column identifying a unique observational
unit for each time point. In fact, the key can be made up of any number
of columns, not just a single one.
In an epi_df
object, the index variable is
time_value
, and the key variable is typically
geo_value
(though this need not always be the case: for
example, if we have an age group variable as another column, then this
could serve as a second key variable). The epiprocess
package thus provides an implementation of as_tsibble()
for
epi_df
objects, which sets these variables according to
these defaults.
library(tsibble)
xt <- as_tsibble(x)
head(xt)
#> # A tsibble: 6 x 5 [1D]
#> # Key: geo_value [1]
#> geo_value time_value cases county_name state_name
#> <chr> <date> <dbl> <chr> <chr>
#> 1 25001 2020-06-01 4 Barnstable County Massachusetts
#> 2 25001 2020-06-02 2 Barnstable County Massachusetts
#> 3 25001 2020-06-03 6 Barnstable County Massachusetts
#> 4 25001 2020-06-04 4 Barnstable County Massachusetts
#> 5 25001 2020-06-05 2 Barnstable County Massachusetts
#> 6 25001 2020-06-06 2 Barnstable County Massachusetts
key(xt)
#> [[1]]
#> geo_value
index(xt)
#> time_value
interval(xt)
#> <interval[1]>
#> [1] 1D
We can also set the key variable(s) directly in a call to
as_tsibble()
. Similar to SQL keys, if the key does not
uniquely identify each time point (that is, the key and index together
do not not uniquely identify each row), then as_tsibble()
throws an error:
head(as_tsibble(x, key = "county_name"))
#> Error in `validate_tsibble()`:
#> ! A valid tsibble must have distinct rows identified by key and index.
#> ℹ Please use `duplicates()` to check the duplicated rows.
As we can see, there are duplicate county names between Massachusetts and Vermont, which caused the error.
head(duplicates(x, key = "county_name"))
#> # A tibble: 6 × 5
#> geo_value time_value cases county_name state_name
#> <chr> <date> <dbl> <chr> <chr>
#> 1 25009 2020-06-01 92 Essex County Massachusetts
#> 2 25011 2020-06-01 0 Franklin County Massachusetts
#> 3 50009 2020-06-01 0 Essex County Vermont
#> 4 50011 2020-06-01 0 Franklin County Vermont
#> 5 25009 2020-06-02 90 Essex County Massachusetts
#> 6 25011 2020-06-02 0 Franklin County Massachusetts
Keying by both county name and state name, however, does work:
head(as_tsibble(x, key = c("county_name", "state_name")))
#> # A tsibble: 6 x 5 [1D]
#> # Key: county_name, state_name [1]
#> geo_value time_value cases county_name state_name
#> <chr> <date> <dbl> <chr> <chr>
#> 1 50001 2020-06-01 0 Addison County Vermont
#> 2 50001 2020-06-02 0 Addison County Vermont
#> 3 50001 2020-06-03 0 Addison County Vermont
#> 4 50001 2020-06-04 0 Addison County Vermont
#> 5 50001 2020-06-05 0 Addison County Vermont
#> 6 50001 2020-06-06 1 Addison County Vermont
One of the major advantages of the tsibble
package is
its ability to handle implicit gaps in time series
data. In other words, it can infer what time scale we’re interested in
(say, daily data), and detect apparent gaps (say, when values are
reported on January 1 and 3 but not January 2). We can subsequently use
functionality to make these missing entries explicit, which will
generally help avoid bugs in further downstream data processing
tasks.
Let’s first remove certain dates from our data set to create gaps:
state_naming <- read_csv("https://github.com/cmu-delphi/covidcast/raw/c89e4d295550ba1540d64d2cc991badf63ad04e5/Python-packages/covidcast-py/covidcast/geo_mappings/state_census.csv", # nolint: line_length_linter
col_types = c(NAME = col_character(), ABBR = col_character())
) %>%
transmute(state_name = NAME, abbr = tolower(ABBR)) %>%
as_tibble()
# First make geo value more readable for tables, plots, etc.
x <- x %>%
inner_join(state_naming, by = "state_name", relationship = "many-to-one", unmatched = c("error", "drop")) %>%
mutate(geo_value = paste(substr(county_name, 1, nchar(county_name) - 7), state_name, sep = ", ")) %>%
select(geo_value, time_value, cases)
xt <- as_tsibble(x) %>% filter(cases >= 3)
The functions has_gaps()
, scan_gaps()
,
count_gaps()
in the tsibble
package each
provide useful summaries, in slightly different formats.
head(has_gaps(xt))
#> # A tibble: 6 × 2
#> geo_value .gaps
#> <chr> <lgl>
#> 1 Addison, Vermont TRUE
#> 2 Barnstable, Massachusetts TRUE
#> 3 Bennington, Vermont TRUE
#> 4 Berkshire, Massachusetts TRUE
#> 5 Bristol, Massachusetts TRUE
#> 6 Caledonia, Vermont TRUE
head(scan_gaps(xt))
#> # A tsibble: 6 x 2 [1D]
#> # Key: geo_value [1]
#> geo_value time_value
#> <chr> <date>
#> 1 Addison, Vermont 2020-08-28
#> 2 Addison, Vermont 2020-08-29
#> 3 Addison, Vermont 2020-08-30
#> 4 Addison, Vermont 2020-08-31
#> 5 Addison, Vermont 2020-09-01
#> 6 Addison, Vermont 2020-09-02
head(count_gaps(xt))
#> # A tibble: 6 × 4
#> geo_value .from .to .n
#> <chr> <date> <date> <int>
#> 1 Addison, Vermont 2020-08-28 2020-10-04 38
#> 2 Addison, Vermont 2020-10-06 2020-10-23 18
#> 3 Addison, Vermont 2020-10-25 2020-11-04 11
#> 4 Addison, Vermont 2020-11-06 2020-11-10 5
#> 5 Addison, Vermont 2020-11-14 2020-11-18 5
#> 6 Addison, Vermont 2020-11-20 2020-11-20 1
We can also visualize the patterns of missingness:
library(ggplot2)
ggplot(
count_gaps(xt),
aes(
x = reorder(geo_value, desc(geo_value)),
color = geo_value
)
) +
geom_linerange(aes(ymin = .from, ymax = .to)) +
geom_point(aes(y = .from)) +
geom_point(aes(y = .to)) +
coord_flip() +
labs(x = "County", y = "Date") +
theme(legend.position = "none")
Using the fill_gaps()
function from
tsibble
, we can replace all gaps by an explicit value. The
default is NA
, but in the current case, where missingness
is not at random but rather represents a small value that was censored
(only a hypothetical with COVID-19 reports, but certainly a real
phenomenon that occurs in other signals), it is better to replace it by
zero, which is what we do here. (Other approaches, such as LOCF: last
observation carried forward in time, could be accomplished by first
filling with NA
values and then following up with a second
call to tidyr::fill()
.)
fill_gaps(xt, cases = 0) %>%
head()
#> # A tsibble: 6 x 3 [1D]
#> # Key: geo_value [1]
#> geo_value time_value cases
#> <chr> <date> <dbl>
#> 1 Addison, Vermont 2020-08-27 3
#> 2 Addison, Vermont 2020-08-28 0
#> 3 Addison, Vermont 2020-08-29 0
#> 4 Addison, Vermont 2020-08-30 0
#> 5 Addison, Vermont 2020-08-31 0
#> 6 Addison, Vermont 2020-09-01 0
Note that the time series for Addison, VT only starts on August 27,
2020, even though the original (uncensored) data set itself was drawn
from a period that went back to June 6, 2020. By setting
.full = TRUE
, we can at zero-fill over the entire span of
the observed (censored) data.
xt_filled <- fill_gaps(xt, cases = 0, .full = TRUE)
head(xt_filled)
#> # A tsibble: 6 x 3 [1D]
#> # Key: geo_value [1]
#> geo_value time_value cases
#> <chr> <date> <dbl>
#> 1 Addison, Vermont 2020-06-01 0
#> 2 Addison, Vermont 2020-06-02 0
#> 3 Addison, Vermont 2020-06-03 0
#> 4 Addison, Vermont 2020-06-04 0
#> 5 Addison, Vermont 2020-06-05 0
#> 6 Addison, Vermont 2020-06-06 0
Explicit imputation for missingness (zero-filling in our case) can be
important for protecting against bugs in all sorts of downstream tasks.
For example, even something as simple as a 7-day trailing average is
complicated by missingness. The function epi_slide()
looks
for all rows within a window of 7 days anchored on the right at the
reference time point (when .window_size = 7
). But when some
days in a given week are missing because they were censored because they
had small case counts, taking an average of the observed case counts can
be misleading and is unintentionally biased upwards. Meanwhile, running
epi_slide()
on the zero-filled data brings these trailing
averages (appropriately) downwards, as we can see inspecting Plymouth,
MA around July 1, 2021.
xt %>%
as_epi_df(as_of = as.Date("2024-03-20")) %>%
group_by(geo_value) %>%
epi_slide(cases_7dav = mean(cases), .window_size = 7) %>%
ungroup() %>%
filter(
geo_value == "Plymouth, MA",
abs(time_value - as.Date("2021-07-01")) <= 3
) %>%
print(n = 7)
#> An `epi_df` object, 0 x 4 with metadata:
#> * geo_type = custom
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 0 × 4
#> # ℹ 4 variables: geo_value <chr>, time_value <date>, cases <dbl>,
#> # cases_7dav <dbl>
xt_filled %>%
as_epi_df(as_of = as.Date("2024-03-20")) %>%
group_by(geo_value) %>%
epi_slide(cases_7dav = mean(cases), .window_size = 7) %>%
ungroup() %>%
filter(
geo_value == "Plymouth, MA",
abs(time_value - as.Date("2021-07-01")) <= 3
) %>%
print(n = 7)
#> An `epi_df` object, 0 x 4 with metadata:
#> * geo_type = custom
#> * time_type = day
#> * as_of = 2024-03-20
#>
#> # A tibble: 0 × 4
#> # ℹ 4 variables: geo_value <chr>, time_value <date>, cases <dbl>,
#> # cases_7dav <dbl>
Geographic aggregation
We do not yet provide tools for geographic aggregation in
epiprocess
. However, we have some Python geocoding
utilities available. Reach out to us if this is functionality you would
like to see us add to epiprocess
.
Attribution
The percent_cli
data is a modified part of the COVIDcast
Epidata API Doctor Visits data. This dataset is licensed under the
terms of the Creative Commons
Attribution 4.0 International license. Copyright Delphi Research
Group at Carnegie Mellon University 2020.
This document contains a dataset that is a modified part of the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University as republished in the COVIDcast Epidata API. This data set is licensed under the terms of the Creative Commons Attribution 4.0 International license by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
From the COVIDcast Epidata API: These signals are taken directly from the JHU CSSE COVID-19 GitHub repository without changes.