quantgen_forecaster(
df,
forecast_date,
signals,
incidence_period,
ahead,
geo_type,
n = 4 * ifelse(incidence_period == "day", 7, 1),
lags = 0,
tau = modeltools::covidhub_probs,
transform = NULL,
inv_trans = NULL,
featurize = NULL,
noncross = FALSE,
noncross_points = c("all", "test", "train"),
cv_type = c("forward", "random"),
verbose = FALSE,
...
)
Arguments
| df |
Data frame of signal values to use for forecasting, of the format
that is returned by covidcast::covidcast_signals(). |
| forecast_date |
Date object or string of the form "YYYY-MM-DD",
indicating the date on which forecasts will be made. For example, if
forecast_date = "2020-05-11", incidence_period = "day", and ahead = 3, then, forecasts would be made for "2020-05-14". |
| signals |
Tibble with columns data_source and signal that specifies
which variables are being fetched from the COVIDcast API, and populated in
df. Each row of signals represents a separate signal, and first row is
taken to be the response. An optional column start_day can also be
included. This can be a Date object or string in the form "YYYY-MM-DD",
indicating the earliest date of data needed from that data source.
Importantly, start_day can also be a function (represented as a list
column) that takes a forecast date and returns a start date for model
training (again, Date object or string in the form "YYYY-MM-DD"). The
latter is useful when the start date should be computed dynamically from
the forecast date (e.g., when the forecaster only trains on the most recent
4 weeks of data). |
| incidence_period |
One of "day or "epiweek", indicating the period over
which forecasts are being made. Default is "day". |
| ahead |
Vector of ahead values, indicating how many days/epiweeks ahead
to forecast. If incidence_period = "day", then ahead = 1 means the day
after forecast date. If incidence_period = "epiweek" and the forecast
date falls on a Sunday or Monday, then ahead = 1 means the epiweek that
includes the forecast date; if forecast_date falls on a Tuesday through
Saturday, then it means the following epiweek. |
| n |
Size of the local training window (in days/weeks, depending on
incidence_period) to use. For example, if n = 14, and incidence_period = "day", then to make a 1-day-ahead forecast on December 15, we train on
data from November 1 to November 14. |
| lags |
Vector of lag values to use as features in the autoregressive
model. For example, when incidence_period = "day", setting lags = c(0, 7, 14)means we use the current value of each signal (defined by a row of
the signals tibble), as well as the values 7 and 14 days ago, as the
features. Recall that the response is defined by the first row of the
signals tibble. Note that lags can also be a list of vectors of lag
values, this list having the same length as the number of rows of
signals, in order to apply a different set of shifts to each signal.
Default is 0, which means no additional lags (only current values) for each
signal. |
| tau |
Vector of quantile levels for the probabilistic forecast. If not
specified, defaults to the levels required by the COVID Forecast Hub. |
| transform, inv_trans |
Transformation and inverse transformations to use
for the response/features. The former transform can be a function or a
list of functions, this list having the same length as the number of rows
in the signals tibble, in order to apply the same transformation or a
different transformation to each signal. These transformations will be
applied before fitting the quantile model. The latter argument inv_trans
specifies the inverse transformation to use on the response variable
(inverse of transform if this is a function, or of transform[[1]] if
transform is a list), which will be applied post prediction from the
quantile model. Several convenience functions for transformations exist as
part of the quantgen package. Default is NULL for both transform and
inv_trans, which means no transformations are applied. |
| featurize |
Function to construct custom features before the quantile
model is fit. As input, this function must take a data frame with columns
geo_value, time_value, then the transformed, lagged signal values. This
function must return a data frame with columns geo_value, time_value,
then any custom features. The rows of the returned data frame must not be
reordered. |
| noncross |
Should noncrossing constraints be applied? These force the
predicted quantiles to be properly ordered across all quantile levels being
considered. The default is FALSE. If TRUE, then noncrossing constraints
are applied to the estimated quantiles at all points specified by the next
argument. |
| noncross_points |
One of "all", "test", "train" indicating which points
to use for the noncrossing constraints: the default "all" means to use both
training and testing sets combined, while "test" or "train" means to use
just one set, training or testing, respectively. |
| cv_type |
One of "forward" or "random", indicating the type of
cross-validation to perform. If "random", then nfolds folds are chosen by
dividing training data points randomly (the default being nfolds = 5). If
"forward", the default, then we instead use a "forward-validation" approach
that better reflects the way predictions are made in the current time
series forecasting context. Roughly, this works as follows: the data points
from the first n - nfolds time values are used for model training, and
then predictions are made at the earliest possible forecast date after this
training period. We march forward one time point at a time and repeat. In
either case ("random" or "forward"), the loss function used for computing
validation error is quantile regression loss (read the documentation for
quantgen::cv_quantile_lasso() for more details); and the final quantile
model is refit on the full training set using the validation-optimal tuning
parameter. |
| verbose |
Should progress be printed out to the console? Default is
FALSE. |
| ... |
Additional arguments. Any parameter accepted by
quantgen::cv_quantile_lasso() (for model training) or by
quantgen:::predict.cv_quantile_genlasso() (for model prediction) can be
passed here. For example, nfolds, for specifying the number of folds used
in cross-validation, or lambda, for specifying the tuning parameter
values over which to perform cross-validation (the default allows
quantgen::cv_quantile_lasso() to set the lambda sequence itself). Note
that fixing a single tuning parameter value (such as lambda = 0)
effectively disables cross-validation and fits a quantile model at the
given tuning parameter value (here unregularized quantile autoregression). |
Value
Data frame with columns ahead, geo_value, quantile, and
value. The quantile column gives the probabilities associated with
quantile forecasts for that location and ahead.