For each of the provided forecast dates, runs a forecaster using the data
that would have been available as of that given forecast date. Returns a list
of "predictions cards", where each list element corresponds to a different
forecast date. A predictions card is a data frame giving the forecast
distributions of a given forecaster for a given forecast task. A forecast
task is specified by the forecast date, ahead, response, incidence period,
and geo type (e.g., 1-epiweek-ahead death forecasting at the state level with
predictions made using the information as of September 14).
get_predictions(
forecaster,
name_of_forecaster,
signals,
forecast_dates,
incidence_period = c("epiweek", "day"),
apply_corrections = function(signals) signals,
response_data_source = signals$data_source[1],
response_data_signal = signals$signal[1],
forecaster_args = list()
)
Arguments
forecaster |
Function that outputs a tibble with columns ahead ,
geo_value , quantile , and value . The quantile column gives the
probabilities associated with quantile forecasts for that location and
ahead. If your forecaster produces point forecasts, then set quantile=NA .
One argument to forecaster must be named df_list . It will be
populated with the list of historical data returned by a call
to COVIDcast. The list will be the same length as the number of rows in
the signals tibble (see below).
The forecaster will also receive a single forecast_date as a named argument.
Any additional named arguments can be passed via the forecaster_args
argument below.
Thus, the forecaster should have a signature like
forecaster(df_list = data, forecast_data = forecast_date, ...) |
name_of_forecaster |
String indicating name of the forecaster. |
signals |
Tibble with mandatory columns data_source and signal and
optional columns start_day , as_of , geo_typ , geo_values .
data_source and signal specify which variables from the COVIDcast API
will be used by forecaster . Each
row of signals represents a separate signal, and first row is taken to be
the response unless explicitly overridden.
If using incidence_period = "epiweek" , the response should
be something for which summing daily values over an epiweek makes sense
(e.g., counts or proportions but not log(counts) or log(proportions)).
Available data sources and signals are documented in the COVIDcast signal documentation.
A few optional columns are also allowed. If not specified, these will default
to the values of the similarly named argument.
A column start_day can be included. This can be a Date
object or string in the form "YYYY-MM-DD", indicating the earliest date of
data needed from that data source. Importantly, start_day can also be a
function (represented as a list column) that takes a forecast date and
returns a start date for model training (again, Date object or string in
the form "YYYY-MM-DD"). The latter is useful when the start date should be
computed dynamically from the forecast date (e.g., when forecaster only
trains on the most recent 4 weeks of data).
You may also include a geo_type column, a geo_values column and/or an
as_of column.
The first two should contain a string. If unspecified, these will have
the same defaults as covidcast::covidcast_signal() , namely
geo_type = "county" and geo_values = "*" .
These arguments allow you to download different data than
what you're actually trying to predict, say using state-level data to
predict national outcomes.
By default, the as_of date of data downloaded from
COVIDcast is loaded with as_of = forecast_date . This means that data
is "rewound" to days in the past. Any data revisions made since, would
not have been present at that time, and would not be available to the
forecaster. It's likely, for example, that no data would actually exist
for the forecast date on the forecast date (there is some latency between
the time signals are reported and the dates for which they are reported).
You can override this functionality, though we strongly advise you do so
with care, by passing a function of the forecast_date or a single date
here. The function should return a Date. |
forecast_dates |
Vector of Date objects (or strings of the form
"YYYY-MM-DD") indicating dates on which forecasts will be made. |
incidence_period |
String indicating the incidence period, either
"epiweek" or "day". |
apply_corrections |
an optional function that applies data corrections
to the signals. Input is a data frame or list as returned as
df <- covidcast::download_signals() .
The returned object should be of the type expected by your forecaster.
This function will be called as apply_corrections(df) . |
response_data_source |
String indicating the data_source of the response.
This is used mainly for downstream evaluation. By default, this will be the
same as the data_source in the first row of the signals tibble. |
response_data_signal |
String indicating the signal of the response.
This is used mainly for downstream evaluation. By default, this will be the
same as the signal in the first row in the signals tibble. |
forecaster_args |
a list of additional named arguments to be passed
to forecaster() . A common use case would be to pass the period ahead
(e.g. predict 1 day, 2 days, ..., k days ahead). Note that ahead is a
required component of the forecaster output (see above). |
Value
Long data frame of forecasts with a class of predictions_cards
.
The first 4 columns are the same as those returned by the forecaster. The
remainder specify the prediction task, 10 columns in total:
ahead
, geo_value
, quantile
, value
, forecaster
, forecast_date
,
data_source
, signal
, target_end_date
, and incidence_period
. Here
data_source
and signal
correspond to the response variable only.
Examples