Get predictions — get_predictions • evalcast

For each of the provided forecast dates, runs a forecaster using the data that would have been available as of that given forecast date. Returns a list of "predictions cards", where each list element corresponds to a different forecast date. A predictions card is a data frame giving the forecast distributions of a given forecaster for a given forecast task. A forecast task is specified by the forecast date, ahead, response, incidence period, and geo type (e.g., 1-epiweek-ahead death forecasting at the state level with predictions made using the information as of September 14).

get_predictions(
  forecaster,
  name_of_forecaster,
  signals,
  forecast_dates,
  incidence_period = c("epiweek", "day"),
  apply_corrections = function(signals) signals,
  response_data_source = signals$data_source[1],
  response_data_signal = signals$signal[1],
  forecaster_args = list()
)

Arguments

forecaster	Function that outputs a tibble with columns `ahead`, `geo_value`, `quantile`, and `value`. The `quantile` column gives the probabilities associated with quantile forecasts for that location and ahead. If your forecaster produces point forecasts, then set `quantile=NA`. One argument to `forecaster` must be named `df_list`. It will be populated with the list of historical data returned by a call to COVIDcast. The list will be the same length as the number of rows in the `signals` tibble (see below). The forecaster will also receive a single `forecast_date` as a named argument. Any additional named arguments can be passed via the `forecaster_args` argument below. Thus, the forecaster should have a signature like `forecaster(df_list = data, forecast_data = forecast_date, ...)`
name_of_forecaster	String indicating name of the forecaster.
signals	Tibble with mandatory columns `data_source` and `signal` and optional columns `start_day`, `as_of`, `geo_typ`, `geo_values`. `data_source` and `signal` specify which variables from the COVIDcast API will be used by `forecaster`. Each row of `signals` represents a separate signal, and first row is taken to be the response unless explicitly overridden. If using `incidence_period = "epiweek"`, the response should be something for which summing daily values over an epiweek makes sense (e.g., counts or proportions but not log(counts) or log(proportions)). Available data sources and signals are documented in the COVIDcast signal documentation. A few optional columns are also allowed. If not specified, these will default to the values of the similarly named argument. A column `start_day` can be included. This can be a Date object or string in the form "YYYY-MM-DD", indicating the earliest date of data needed from that data source. Importantly, `start_day` can also be a function (represented as a list column) that takes a forecast date and returns a start date for model training (again, Date object or string in the form "YYYY-MM-DD"). The latter is useful when the start date should be computed dynamically from the forecast date (e.g., when `forecaster` only trains on the most recent 4 weeks of data). You may also include a `geo_type` column, a `geo_values` column and/or an `as_of` column. The first two should contain a string. If unspecified, these will have the same defaults as `covidcast::covidcast_signal()`, namely `geo_type = "county"` and `geo_values = "*"`. These arguments allow you to download different data than what you're actually trying to predict, say using state-level data to predict national outcomes. By default, the `as_of` date of data downloaded from COVIDcast is loaded with `as_of = forecast_date`. This means that data is "rewound" to days in the past. Any data revisions made since, would not have been present at that time, and would not be available to the forecaster. It's likely, for example, that no data would actually exist for the forecast date on the forecast date (there is some latency between the time signals are reported and the dates for which they are reported). You can override this functionality, though we strongly advise you do so with care, by passing a function of the forecast_date or a single date here. The function should return a Date.
forecast_dates	Vector of Date objects (or strings of the form "YYYY-MM-DD") indicating dates on which forecasts will be made.
incidence_period	String indicating the incidence period, either "epiweek" or "day".
apply_corrections	an optional function that applies data corrections to the signals. Input is a data frame or list as returned as `df <- covidcast::download_signals()`. The returned object should be of the type expected by your forecaster. This function will be called as `apply_corrections(df)`.
response_data_source	String indicating the `data_source` of the response. This is used mainly for downstream evaluation. By default, this will be the same as the `data_source` in the first row of the `signals` tibble.
response_data_signal	String indicating the `signal` of the response. This is used mainly for downstream evaluation. By default, this will be the same as the `signal` in the first row in the `signals` tibble.
forecaster_args	a list of additional named arguments to be passed to `forecaster()`. A common use case would be to pass the period ahead (e.g. predict 1 day, 2 days, ..., k days ahead). Note that `ahead` is a required component of the forecaster output (see above).

Value

Long data frame of forecasts with a class of predictions_cards. The first 4 columns are the same as those returned by the forecaster. The remainder specify the prediction task, 10 columns in total: ahead, geo_value, quantile, value, forecaster, forecast_date, data_source, signal, target_end_date, and incidence_period. Here data_source and signal correspond to the response variable only.

Examples

if (FALSE) {
baby_predictions = get_predictions(
  baseline_forecaster, "baby",
  tibble::tibble(
    data_source="jhu-csse",
    signal ="deaths_incidence_num",
    start_day="2020-08-15",
    geo_values = "mi",
    geo_type = "state"), 
  forecast_dates = "2020-10-01",
  incidence_period = "epiweek",
  forecaster_args = list(
    incidence_period = "epiweek",
    ahead = 1:4
  ))
}