A function to describe revision behavior for an archive.
Source:R/revision_analysis.R
revision_analysis.Rdrevision_summary removes all missing values (if requested), and
then computes some basic statistics about the revision behavior
of an archive, returning a tibble summarizing the revisions per
time_value+epi_key features (limited to those that have data
available past the min waiting period to compare against, and are
not detected as added in a bulk report). The columns returned
are:
n_revisions: the total number of revisions for that entrymin_lag: the minimum time to any value (ifdrop_nas=FALSE, this includesNA's)max_lag: the amount of time until the final (new) version (same caveat fordrop_nas=FALSE, though it is far less likely to matter)min_value: the minimum value across revisionsmax_value: the maximum value across revisionsmedian_value: the median value across revisionsspread: the difference between the smallest and largest values (this always excludesNAvalues)rel_spread:spreaddivided by the largest value (so it will always be less than 1). Note that this need not be the final value. It will beNAwheneverspreadis 0.lag_near_latest: the time taken for the revisions to settle to withinwithin_latest(default 20%) of the final value and stay there. For example, consider the series (0, 20, 99, 150, 102, 100); thenlag_near_latestis 5, since even though 99 is within 20%, it is outside the window afterwards at 150.
Usage
revision_analysis(
epi_arch,
...,
min_waiting_period = as.difftime(60, units = "days"),
within_latest = 0.2,
bulk_reporting_level = 0.8,
bulk_reporting_multiplier = 1.2,
compactify = TRUE,
compactify_abs_tol = 0,
compactify_drop_initial_nas = TRUE,
drop_nas = FALSE,
return_only_tibble = FALSE
)
# S3 method for class 'revision_analysis'
print(
x,
quick_revision = as.difftime(3, units = "days"),
few_revisions = 3,
abs_spread_threshold = NULL,
rel_spread_threshold = 0.1,
...
)
revision_summary(
epi_arch,
...,
min_waiting_period = as.difftime(60, units = "days"),
within_latest = 0.2,
bulk_reporting_level = 0.8,
bulk_reporting_multiplier = 1.2,
compactify = TRUE,
compactify_abs_tol = 0,
compactify_drop_initial_nas = TRUE,
drop_nas = FALSE,
return_only_tibble = FALSE
)Arguments
- epi_arch
an epi_archive to be analyzed
- ...
<
tidyselect>, used to choose the column to summarize. If empty and there is only one value/measurement column (i.e., not inkey_colnames) in the archive, it will automatically select it. If supplied,...must select exactly one column.- min_waiting_period
difftime, integer orNULL. Sets a cutoff: any time_values that have not had at leastmin_waiting_periodto stabilize as of theversions_endare removed.min_waiting_periodshould characterize the typical time during which most significant revisions occur. The default of 60 days corresponds to a typical near-final value for case counts as reported in the context of insurance. To avoid this filtering, either set toNULLor 0. Adifftimewill be rounded up to the appropriatetime_typeif necessary (that is 5 days will be rounded to 1 week if the data is weekly).- within_latest
double between 0 and 1. Determines the threshold used for the
lag_to- bulk_reporting_level, bulk_reporting_multiplier
numeric; the former between 0 and 1, typically close to but less than one, and the latter
>= 1; defaults of 0.8 and 1.2, respectively. Determines how to detect bulk reporting. Consider the distribution of "max initial lags" across geodemographic group x version pairs that add initial observations for new time values; a bulk reporting lag threshold is determined by taking thebulk_reporting_level-th quantile of this distribution, multiplying bybulk_reporting_multiplier, and rounding to an integer number of time intervals. To avoid flagging anything as bulk reporting, setbulk_reporting_level = 1.- compactify
bool. If
TRUE, we will compactify after the signal requested in...has been selected on its own and thedrop_nasstep. This helps, for example, to give similar results when called on merged and single-signal archives, since merged archives record an update when any of the other signals change, not just the requested signal. The default isTRUE.- compactify_abs_tol
length-1 double, used if
compactifyisTRUE, it determines the threshold for when two doubles are considered identical.- compactify_drop_initial_nas
bool; should we drop initial estimates of NA during the compactification step? Default is TRUE, because these NAs are likely present due to
epix_merge()ing with a more timely indicator, and the upstream source probably didn't actually report explicit NAs as provisional estimates.- drop_nas
bool, do we drop all
NAvalues (not just initial estimates ofNA) from the archive data structure prior to (optional) compactification? This both strips (i) initial estimates of NA and (ii) explicit revisions from a non-NA estimate to NA. Equivalent tocompactify_drop_initial_nasifcompactifyisTRUE. Default is FALSE, favoringcompactify = TRUE, compactify_drop_initial_nas = TRUE.- return_only_tibble
boolean to return only the simple
tibbleof computational results rather than the complete S3 object.- x
a
revision_analysisobject- quick_revision
Difftime or integer (integer is treated as days). The amount of time between the final revision and the actual time_value to consider the revision quickly resolved. Default of 3 days. This will be rounded up to the appropriate
time_typeif necessary (that is 5 days will be rounded to 1 week if the data is weekly).- few_revisions
Integer. The upper bound on the number of revisions to consider "few". Default is 3.
- abs_spread_threshold
Scalar numeric. The maximum spread used to characterize revisions which don't actually change very much. Default is 5% of the maximum value in the dataset, but this is the most unit dependent of values, and likely needs to be chosen appropriate for the scale of the dataset.
- rel_spread_threshold
Scalar between 0 and 1. The relative spread fraction used to characterize revisions which don't actually change very much. Default is .1, or 10% of the final value
Value
An S3 object with class revision_analysis. This function is typically
called for the purposes of inspecting the printed output. The
results of the computations are available in
revision_analysis(...)$revision_behavior. If you only want to access
the internal computations, use return_only_tibble = TRUE.
Details
Applies to epi_archives with time_types of "day", "week",
and "yearmonth". It can also work with a time_type of "integer" if
the possible time_values are all consecutive integers; you will need to
manually specify the min_waiting_period and quick_revision, though.
Using a time_type of "integer" with week numbers like 202501 will
produce incorrect results for some calculations, since week numbering
contains jumps at year boundaries.
Examples
# Print revision summary:
revision_analysis(archive_cases_dv_subset, percent_cli)
#>
#> ── Revision analysis for archive spanning time values 2020-06-01 to 2021-11-30. ──
#>
#> ── Across epi_key + versions that add new time values:
#> Freshest new time value's lag/latency:
#> min median mean max
#> 3 days 3 days 3.1 days 4 days
#> Farthest-back new time value's lag/latency:
#> min median mean max
#> 3 days 3 days 3.3 days 12 days
#>
#> ── Across epi_key + time_value + versions:
#> Fraction of all versions that are `NA`:
#> • 0 out of 112,360 (0%)
#>
#> ── Bulk reporting adding initial observations for older epikey + time values:
#> Initial lags above 5 days were counted as bulk reporting.
#> Fraction of epi_key + time_values initially added by bulk reporting:
#> • 116 out of 1,956 (5.93%)
#> Versions containing bulk reporting: 10
#> • (2020-08-03, 2020-08-16, 2020-08-23, 2020-09-25, 2021-02-22, 2021-04-15,
#> 2021-07-24, 2021-08-24, 2021-09-23, and 2021-09-30)
#> Versions adding epikey + time values but no bulk reporting: 400
#> Revision-only versions: 45
#>
#> ── Remaining information is for non-bulk-reported epikey + time values
#> with semi-stable versions past the waiting period available.
#>
#> ── Fraction of epi_key + time_values with
#> No revisions:
#> • 0 out of 1,840 (0%)
#> Quick revisions (last revision within 3 days of the `time_value`):
#> • 0 out of 1,840 (0%)
#> Few revisions (At most 3 revisions for that `time_value`):
#> • 0 out of 1,840 (0%)
#>
#> ── Fraction of revised epi_key + time_values which have:
#> Less than 0.1 spread in relative value:
#> • 66 out of 1,840 (3.59%)
#> Spread of more than 2.221 in actual value (when revised):
#> • 658 out of 1,840 (35.76%)
#>
#> ── Days until within 20% of the latest value:
#> min median mean max
#> 3 days 5 days 8.9 days 67 days
#>
#> ── Days until at the latest lag:
#> min median mean max
#> 58 days 73 days 70.7 days 74 days
# Print some underlying data:
revision_example <- revision_analysis(archive_cases_dv_subset, percent_cli)
revision_example$revision_behavior %>% arrange(desc(spread))
#> # A tibble: 1,840 × 11
#> time_value geo_value n_revisions min_lag max_lag lag_near_latest spread
#> <date> <chr> <int> <drtn> <drtn> <drtn> <dbl>
#> 1 2020-12-26 ca 62 3 days 73 days 6 days 14.1
#> 2 2020-12-25 ca 62 3 days 73 days 7 days 13.2
#> 3 2020-11-27 fl 66 3 days 73 days 4 days 12.0
#> 4 2021-09-27 fl 43 3 days 63 days 59 days 9.79
#> 5 2020-12-25 fl 62 3 days 73 days 4 days 9.75
#> 6 2021-09-26 fl 43 4 days 64 days 60 days 9.48
#> 7 2021-09-27 ca 43 3 days 63 days 8 days 9.31
#> 8 2021-09-25 fl 43 5 days 65 days 61 days 8.83
#> 9 2020-11-05 ny 66 3 days 73 days 11 days 8.64
#> 10 2020-11-27 tx 66 3 days 73 days 10 days 8.56
#> # ℹ 1,830 more rows
#> # ℹ 4 more variables: rel_spread <dbl>, min_value <dbl>, max_value <dbl>,
#> # median_value <dbl>