This chapter describes functionality for detecting and correcting outliers in signals in the detect_outlr() and correct_outlr() functions provided in the epiprocess package. These functions is designed to be modular and extendable, so that you can define your own outlier detection and correction routines and apply them to epi_df objects. We’ll demonstrate this using state-level daily reported COVID-19 case counts from FL and NJ.
There are multiple outliers in these data that a modeler may want to detect and correct. We’ll discuss those two tasks in turn.
6.1 Outlier detection
The detect_outlr() function allows us to run multiple outlier detection methods on a given signal, and then (optionally) combine the results from those methods. Here, we’ll investigate outlier detection results from the following methods.
Detection based on a rolling median, using detect_outlr_rm(), which computes a rolling median on with a default window size of n time points centered at the time point under consideration, and then computes thresholds based on a multiplier times a rolling IQR computed on the residuals.
Detection based on a seasonal-trend decomposition using LOESS (STL), using detect_outlr_stl(), which is similar to the rolling median method but replaces the rolling median with fitted values from STL.
Detection based on an STL decomposition, but without seasonality term, which amounts to smoothing using LOESS.
The outlier detection methods are specified using a tibble that is passed to detect_outlr(), with one row per method, and whose columms specify the outlier detection function, any input arguments (only nondefault values need to be supplied), and an abbreviated name for the method used in tracking results. Abbreviations “rm” and “stl” can be used for the built-in detection functions detect_outlr_rm() and detect_outlr_stl(), respectively.
#> # A tibble: 2 × 3
#> method args abbr
#> <chr> <list> <chr>
#> 1 rm <named list [2]> rm
#> 2 stl <named list [3]> stl_seasonal
Additionally, we’ll form combined lower and upper thresholds, calculated as the median of the lower and upper thresholds from the methods at each time point. Note that using this combined median threshold is equivalent to using a majority vote across the base methods to determine whether a value is an outlier.
x <- incidence_num_outlier_example %>%group_by(geo_value) %>%mutate(outlier_info =detect_outlr(x = time_value, y = cases,methods = detection_methods,combiner ="median" ) ) %>%unpack(outlier_info) %>%ungroup()x
Finally, in order to correct outliers, we can use the posited replacement values returned by each outlier detection method. Below we use the replacement value from the combined method, which is defined by the median of replacement values from the base methods at each time point.
y <- x %>%mutate(cases_corrected = combined_replacement) %>%select(geo_value, time_value, cases, cases_corrected)y %>%filter(cases != cases_corrected)