quantgen_forecaster(
df,
forecast_date,
signals,
incidence_period,
ahead,
geo_type,
n = 4 * ifelse(incidence_period == "day", 7, 1),
lags = 0,
tau = modeltools::covidhub_probs,
transform = NULL,
inv_trans = NULL,
featurize = NULL,
noncross = FALSE,
noncross_points = c("all", "test", "train"),
cv_type = c("forward", "random"),
verbose = FALSE,
...
)
Arguments
df |
Data frame of signal values to use for forecasting, of the format
that is returned by covidcast::covidcast_signals() . |
forecast_date |
Date object or string of the form "YYYY-MM-DD",
indicating the date on which forecasts will be made. For example, if
forecast_date = "2020-05-11" , incidence_period = "day" , and ahead = 3 , then, forecasts would be made for "2020-05-14". |
signals |
Tibble with columns data_source and signal that specifies
which variables are being fetched from the COVIDcast API, and populated in
df . Each row of signals represents a separate signal, and first row is
taken to be the response. An optional column start_day can also be
included. This can be a Date object or string in the form "YYYY-MM-DD",
indicating the earliest date of data needed from that data source.
Importantly, start_day can also be a function (represented as a list
column) that takes a forecast date and returns a start date for model
training (again, Date object or string in the form "YYYY-MM-DD"). The
latter is useful when the start date should be computed dynamically from
the forecast date (e.g., when the forecaster only trains on the most recent
4 weeks of data). |
incidence_period |
One of "day or "epiweek", indicating the period over
which forecasts are being made. Default is "day". |
ahead |
Vector of ahead values, indicating how many days/epiweeks ahead
to forecast. If incidence_period = "day" , then ahead = 1 means the day
after forecast date. If incidence_period = "epiweek" and the forecast
date falls on a Sunday or Monday, then ahead = 1 means the epiweek that
includes the forecast date; if forecast_date falls on a Tuesday through
Saturday, then it means the following epiweek. |
n |
Size of the local training window (in days/weeks, depending on
incidence_period ) to use. For example, if n = 14 , and incidence_period = "day" , then to make a 1-day-ahead forecast on December 15, we train on
data from November 1 to November 14. |
lags |
Vector of lag values to use as features in the autoregressive
model. For example, when incidence_period = "day" , setting lags = c(0, 7, 14) means we use the current value of each signal (defined by a row of
the signals tibble), as well as the values 7 and 14 days ago, as the
features. Recall that the response is defined by the first row of the
signals tibble. Note that lags can also be a list of vectors of lag
values, this list having the same length as the number of rows of
signals , in order to apply a different set of shifts to each signal.
Default is 0, which means no additional lags (only current values) for each
signal. |
tau |
Vector of quantile levels for the probabilistic forecast. If not
specified, defaults to the levels required by the COVID Forecast Hub. |
transform, inv_trans |
Transformation and inverse transformations to use
for the response/features. The former transform can be a function or a
list of functions, this list having the same length as the number of rows
in the signals tibble, in order to apply the same transformation or a
different transformation to each signal. These transformations will be
applied before fitting the quantile model. The latter argument inv_trans
specifies the inverse transformation to use on the response variable
(inverse of transform if this is a function, or of transform[[1]] if
transform is a list), which will be applied post prediction from the
quantile model. Several convenience functions for transformations exist as
part of the quantgen package. Default is NULL for both transform and
inv_trans , which means no transformations are applied. |
featurize |
Function to construct custom features before the quantile
model is fit. As input, this function must take a data frame with columns
geo_value , time_value , then the transformed, lagged signal values. This
function must return a data frame with columns geo_value , time_value ,
then any custom features. The rows of the returned data frame must not be
reordered. |
noncross |
Should noncrossing constraints be applied? These force the
predicted quantiles to be properly ordered across all quantile levels being
considered. The default is FALSE . If TRUE , then noncrossing constraints
are applied to the estimated quantiles at all points specified by the next
argument. |
noncross_points |
One of "all", "test", "train" indicating which points
to use for the noncrossing constraints: the default "all" means to use both
training and testing sets combined, while "test" or "train" means to use
just one set, training or testing, respectively. |
cv_type |
One of "forward" or "random", indicating the type of
cross-validation to perform. If "random", then nfolds folds are chosen by
dividing training data points randomly (the default being nfolds = 5 ). If
"forward", the default, then we instead use a "forward-validation" approach
that better reflects the way predictions are made in the current time
series forecasting context. Roughly, this works as follows: the data points
from the first n - nfolds time values are used for model training, and
then predictions are made at the earliest possible forecast date after this
training period. We march forward one time point at a time and repeat. In
either case ("random" or "forward"), the loss function used for computing
validation error is quantile regression loss (read the documentation for
quantgen::cv_quantile_lasso() for more details); and the final quantile
model is refit on the full training set using the validation-optimal tuning
parameter. |
verbose |
Should progress be printed out to the console? Default is
FALSE . |
... |
Additional arguments. Any parameter accepted by
quantgen::cv_quantile_lasso() (for model training) or by
quantgen:::predict.cv_quantile_genlasso() (for model prediction) can be
passed here. For example, nfolds , for specifying the number of folds used
in cross-validation, or lambda , for specifying the tuning parameter
values over which to perform cross-validation (the default allows
quantgen::cv_quantile_lasso() to set the lambda sequence itself). Note
that fixing a single tuning parameter value (such as lambda = 0 )
effectively disables cross-validation and fits a quantile model at the
given tuning parameter value (here unregularized quantile autoregression). |
Value
Data frame with columns ahead
, geo_value
, quantile
, and
value
. The quantile
column gives the probabilities associated with
quantile forecasts for that location and ahead.