Skip to contents

Detects outliers based on a distance from the rolling median specified in terms of multiples of the rolling interquartile range (IQR).

Usage

detect_outlr_rm(
  x = seq_along(y),
  y,
  n = 21,
  log_transform = FALSE,
  detect_negatives = FALSE,
  detection_multiplier = 2,
  min_radius = 0,
  replacement_multiplier = 0
)

Arguments

x

Design points corresponding to the signal values y. Default is seq_along(y) (that is, equally-spaced points from 1 to the length of y).

y

Signal values.

n

Number of time steps to use in the rolling window. Default is 21. This value is centrally aligned. When n is an odd number, the rolling window extends from (n-1)/2 time steps before each design point to (n-1)/2 time steps after. When n is even, then the rolling range extends from n/2-1 time steps before to n/2 time steps after.

log_transform

Should a log transform be applied before running outlier detection? Default is FALSE. If TRUE, and zeros are present, then the log transform will be padded by 1.

detect_negatives

Should negative values automatically count as outliers? Default is FALSE.

detection_multiplier

Value determining how far the outlier detection thresholds are from the rolling median, which are calculated as (rolling median) +/- (detection multiplier) * (rolling IQR). Default is 2.

min_radius

Minimum distance between rolling median and threshold, on transformed scale. Default is 0.

replacement_multiplier

Value determining how far the replacement values are from the rolling median. The replacement is the original value if it is within the detection thresholds, or otherwise it is rounded to the nearest (rolling median) +/- (replacement multiplier) * (rolling IQR). Default is 0.

Value

An tibble with number of rows equal to length(y) and columns giving the outlier detection thresholds (lower and upper) and replacement values from each detection method (replacement).

Examples

# Detect outliers based on a rolling median
incidence_num_outlier_example %>%
  dplyr::select(geo_value, time_value, cases) %>%
  as_epi_df() %>%
  group_by(geo_value) %>%
  mutate(outlier_info = detect_outlr_rm(
    x = time_value, y = cases
  )) %>%
  unnest(outlier_info)
#> An `epi_df` object, 730 x 6 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-21 22:17:14.962335
#> 
#> # A tibble: 730 × 6
#> # Groups:   geo_value [2]
#>    geo_value time_value cases lower upper replacement
#>  * <chr>     <date>     <dbl> <dbl> <dbl>       <dbl>
#>  1 fl        2020-06-01   667  530  2010          667
#>  2 nj        2020-06-01   486  150.  840.         486
#>  3 fl        2020-06-02   617  582. 1992.         617
#>  4 nj        2020-06-02   658  210.  771.         658
#>  5 fl        2020-06-03  1317  635  1975         1317
#>  6 nj        2020-06-03   541  270   702          541
#>  7 fl        2020-06-04  1419  713  1909         1419
#>  8 nj        2020-06-04   478  174.  790.         478
#>  9 fl        2020-06-05  1305  553  2081         1305
#> 10 nj        2020-06-05   825  118.  838.         825
#> # ℹ 720 more rows