A function to describe revision behavior for an archive.
Source:R/revision_analysis.R
revision_summary.Rd
revision_summary
removes all missing values (if requested), and then
computes some basic statistics about the revision behavior of an archive,
returning a tibble summarizing the revisions per time_value+epi_key
features. If print_inform
is true, it prints a concise summary. The
columns returned are:
n_revisions
: the total number of revisions for that entrymin_lag
: the minimum time to any value (ifdrop_nas=FALSE
, this includesNA
's)max_lag
: the amount of time until the final (new) version (same caveat fordrop_nas=FALSE
, though it is far less likely to matter)min_value
: the minimum value across revisionsmax_value
: the maximum value across revisionsmedian_value
: the median value across revisionsspread
: the difference between the smallest and largest values (this always excludesNA
values)rel_spread
:spread
divided by the largest value (so it will always be less than 1). Note that this need not be the final value. It will beNA
wheneverspread
is 0.time_near_latest
: the time taken for the revisions to settle to withinwithin_latest
(default 20%) of the final value and stay there. For example, consider the series (0, 20, 99, 150, 102, 100); thentime_near_latest
is 5, since even though 99 is within 20%, it is outside the window afterwards at 150.
Usage
revision_summary(
epi_arch,
...,
drop_nas = TRUE,
print_inform = TRUE,
min_waiting_period = as.difftime(60, units = "days"),
within_latest = 0.2,
quick_revision = as.difftime(3, units = "days"),
few_revisions = 3,
abs_spread_threshold = NULL,
rel_spread_threshold = 0.1,
compactify_tol = .Machine$double.eps^0.5,
should_compactify = TRUE
)
Arguments
- epi_arch
an epi_archive to be analyzed
- ...
<
tidyselect
>, used to choose the column to summarize. If empty, it chooses the first. Currently only implemented for one column at a time.- drop_nas
bool, drop any
NA
values from the archive? After droppingNA
's compactify is run again to make sure there are no duplicate values from occasions when the signal is revised toNA
, and then back to its immediately-preceding value.- print_inform
bool, determines whether to print summary information, or only return the full summary tibble
- min_waiting_period
difftime
, integer orNULL
. Sets a cutoff: any time_values not earlier thanmin_waiting_period
beforeversions_end
are removed.min_waiting_period
should characterize the typical time during which revisions occur. The default of 60 days corresponds to a typical final value for case counts as reported in the context of insurance. To avoid this filtering, either set toNULL
or 0.- within_latest
double between 0 and 1. Determines the threshold used for the
time_to
- quick_revision
difftime or integer (integer is treated as days), for the printed summary, the amount of time between the final revision and the actual time_value to consider the revision quickly resolved. Default of 3 days
- few_revisions
integer, for the printed summary, the upper bound on the number of revisions to consider "few". Default is 3.
- abs_spread_threshold
numeric, for the printed summary, the maximum spread used to characterize revisions which don't actually change very much. Default is 5% of the maximum value in the dataset, but this is the most unit dependent of values, and likely needs to be chosen appropriate for the scale of the dataset.
- rel_spread_threshold
float between 0 and 1, for the printed summary, the relative spread fraction used to characterize revisions which don't actually change very much. Default is .1, or 10% of the final value
- compactify_tol
float, used if
drop_nas=TRUE
, it determines the threshold for when two floats are considered identical.- should_compactify
bool. Compactify if
TRUE
.
Examples
revision_example <- revision_summary(archive_cases_dv_subset, percent_cli)
#> Min lag (time to first version):
#> min median mean max
#> 3 days 3 days 3.5 days 12 days
#> Fraction of epi_key+time_values with
#> No revisions:
#> • 0 out of 1,956 (0%)
#> Quick revisions (last revision within 3 days of the `time_value`):
#> • 0 out of 1,956 (0%)
#> Few revisions (At most 3 revisions for that `time_value`):
#> • 0 out of 1,956 (0%)
#>
#> Fraction of revised epi_key+time_values which have:
#> Less than 0.1 spread in relative value:
#> • 91 out of 1,956 (4.65%)
#> Spread of more than 2.22056495 in actual value (when revised):
#> • 671 out of 1,956 (34.3%)
#> days until within 20% of the latest value:
#> min median mean max
#> 3 days 5 days 9.1 days 67 days
revision_example %>% arrange(desc(spread))
#> # A tibble: 1,956 × 11
#> time_value geo_value n_revisions min_lag max_lag time_near_latest spread
#> <date> <chr> <dbl> <drtn> <drtn> <drtn> <dbl>
#> 1 2020-12-26 ca 62 3 days 73 days 6 days 14.1
#> 2 2020-12-25 ca 62 3 days 73 days 7 days 13.2
#> 3 2020-11-27 fl 66 3 days 73 days 4 days 12.0
#> 4 2021-09-27 fl 43 3 days 63 days 59 days 9.79
#> 5 2020-12-25 fl 62 3 days 73 days 4 days 9.75
#> 6 2021-09-26 fl 43 4 days 64 days 60 days 9.48
#> 7 2021-09-27 ca 43 3 days 63 days 8 days 9.31
#> 8 2021-09-25 fl 43 5 days 65 days 61 days 8.83
#> 9 2020-11-05 ny 66 3 days 73 days 11 days 8.64
#> 10 2020-11-27 tx 66 3 days 73 days 10 days 8.56
#> # ℹ 1,946 more rows
#> # ℹ 4 more variables: rel_spread <dbl>, min_value <dbl>, max_value <dbl>,
#> # median_value <dbl>