6  Detect and correct outliers in signals

This chapter describes functionality for detecting and correcting outliers in signals in the detect_outlr() and correct_outlr() functions provided in the epiprocess package. These functions is designed to be modular and extendable, so that you can define your own outlier detection and correction routines and apply them to epi_df objects. We’ll demonstrate this using state-level daily reported COVID-19 case counts from FL and NJ.

incidence_num_outlier_example
#> An `epi_df` object, 730 x 3 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2024-08-22 19:40:18.860591
#> 
#> # A tibble: 730 × 3
#>   geo_value time_value cases
#> * <chr>     <date>     <dbl>
#> 1 fl        2020-06-01   667
#> 2 nj        2020-06-01   486
#> 3 fl        2020-06-02   617
#> 4 nj        2020-06-02   658
#> 5 fl        2020-06-03  1317
#> 6 nj        2020-06-03   541
#> # ℹ 724 more rows
Code
ggplot(incidence_num_outlier_example, aes(x = time_value, y = cases, color = geo_value)) +
  geom_line() +
  scale_color_manual(values = c(3, 6)) +
  geom_hline(yintercept = 0, linetype = 3) +
  facet_wrap(~geo_value, scales = "free_y", ncol = 1) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
  labs(x = "Date", y = "Reported COVID-19 counts")

There are multiple outliers in these data that a modeler may want to detect and correct. We’ll discuss those two tasks in turn.

6.1 Outlier detection

The detect_outlr() function allows us to run multiple outlier detection methods on a given signal, and then (optionally) combine the results from those methods. Here, we’ll investigate outlier detection results from the following methods.

  1. Detection based on a rolling median, using detect_outlr_rm(), which computes a rolling median on with a default window size of n time points centered at the time point under consideration, and then computes thresholds based on a multiplier times a rolling IQR computed on the residuals.
  2. Detection based on a seasonal-trend decomposition using LOESS (STL), using detect_outlr_stl(), which is similar to the rolling median method but replaces the rolling median with fitted values from STL.
  3. Detection based on an STL decomposition, but without seasonality term, which amounts to smoothing using LOESS.

The outlier detection methods are specified using a tibble that is passed to detect_outlr(), with one row per method, and whose columms specify the outlier detection function, any input arguments (only nondefault values need to be supplied), and an abbreviated name for the method used in tracking results. Abbreviations “rm” and “stl” can be used for the built-in detection functions detect_outlr_rm() and detect_outlr_stl(), respectively.

detection_methods <- bind_rows(
  tibble(
    method = "rm",
    args = list(list(
      detect_negatives = TRUE,
      detection_multiplier = 2.5
    )),
    abbr = "rm"
  ),
  tibble(
    method = "stl",
    args = list(list(
      detect_negatives = TRUE,
      detection_multiplier = 2.5,
      seasonal_period = 7
    )),
    abbr = "stl_seasonal"
  )
)

detection_methods
#> # A tibble: 2 × 3
#>   method args             abbr        
#>   <chr>  <list>           <chr>       
#> 1 rm     <named list [2]> rm          
#> 2 stl    <named list [3]> stl_seasonal

Additionally, we’ll form combined lower and upper thresholds, calculated as the median of the lower and upper thresholds from the methods at each time point. Note that using this combined median threshold is equivalent to using a majority vote across the base methods to determine whether a value is an outlier.

x <- incidence_num_outlier_example %>%
  group_by(geo_value) %>%
  mutate(
    outlier_info = detect_outlr(
      x = time_value, y = cases,
      methods = detection_methods,
      combiner = "median"
    )
  ) %>%
  unpack(outlier_info) %>%
  ungroup()

x
#> # A tibble: 730 × 14
#>   geo_value time_value cases rm_geo_value rm_lower rm_upper rm_replacement
#>   <chr>     <date>     <dbl>        <dbl>    <dbl>    <dbl>          <dbl>
#> 1 fl        2020-06-01   667            0    345      2195             667
#> 2 nj        2020-06-01   486            0     64.4     926.            486
#> 3 fl        2020-06-02   617            0    406.     2169.            617
#> 4 nj        2020-06-02   658            0    140.      841.            658
#> 5 fl        2020-06-03  1317            0    468.     2142.           1317
#> 6 nj        2020-06-03   541            0    216       756             541
#> # ℹ 724 more rows
#> # ℹ 7 more variables: stl_seasonal_geo_value <dbl>, …

To visualize the results, we define a convenience function for and call it on each state separately (hidden below the fold).

Code
# Plot outlier detection bands and/or points identified as outliers
plot_outlr <- function(
    x, signal, method_abbr, bands = TRUE, points = TRUE,
    facet_vars = vars(geo_value), nrow = NULL, ncol = NULL,
    scales = "fixed") {
  # Convert outlier detection results to long format
  signal <- rlang::enquo(signal)
  x_long <- x %>%
    pivot_longer(
      cols = starts_with(method_abbr),
      names_to = c("method", ".value"),
      names_pattern = "(.+)_(.+)"
    )

  # Start of plot with observed data
  p <- ggplot() +
    geom_line(data = x, mapping = aes(x = time_value, y = !!signal))

  # If requested, add bands
  if (bands) {
    p <- p + geom_ribbon(
      data = x_long,
      aes(
        x = time_value, ymin = lower, ymax = upper,
        color = method
      ), fill = NA
    )
  }

  # If requested, add points
  if (points) {
    x_detected <- x_long %>% filter((!!signal < lower) | (!!signal > upper))
    p <- p + geom_point(
      data = x_detected,
      aes(
        x = time_value, y = !!signal, color = method,
        shape = method
      )
    )
  }

  # If requested, add faceting
  if (!is.null(facet_vars)) {
    p <- p + facet_wrap(facet_vars, nrow = nrow, ncol = ncol, scales = scales)
  }

  return(p)
}

Now we produce plots for each state at a time, faceting by the detection method.

Code
method_abbr <- c(detection_methods$abbr, "combined")

plot_outlr(x %>% filter(geo_value == "fl"), cases, method_abbr,
  facet_vars = vars(method), scales = "free_y", ncol = 2
) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
  labs(
    x = "Date", y = "Reported COVID-19 counts", color = "Method",
    shape = "Method"
  ) +
  scale_color_brewer(palette = "Set1") +
  ggtitle("Florida") +
  theme(legend.position = "bottom")

Code
plot_outlr(x %>% filter(geo_value == "nj"), cases, method_abbr,
  facet_vars = vars(method), scales = "free_y", ncol = 2
) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
  labs(
    x = "Date", y = "Reported COVID-19 counts", color = "Method",
    shape = "Method"
  ) +
  scale_color_brewer(palette = "Set1") +
  ggtitle("New Jersey") +
  theme(legend.position = "bottom")

6.2 Outlier correction

Finally, in order to correct outliers, we can use the posited replacement values returned by each outlier detection method. Below we use the replacement value from the combined method, which is defined by the median of replacement values from the base methods at each time point.

y <- x %>%
  mutate(cases_corrected = combined_replacement) %>%
  select(geo_value, time_value, cases, cases_corrected)

y %>% filter(cases != cases_corrected)
#> # A tibble: 34 × 4
#>   geo_value time_value cases cases_corrected
#>   <chr>     <date>     <dbl>           <dbl>
#> 1 fl        2020-07-12 15300          10181 
#> 2 nj        2020-07-19    -8            249.
#> 3 nj        2020-07-31   748            405 
#> 4 fl        2020-08-12  8109           5803.
#> 5 nj        2020-08-13   694            381 
#> 6 nj        2020-08-14   619            381 
#> # ℹ 28 more rows
Code
y %>%
  pivot_longer(starts_with("cases")) %>%
  ggplot(aes(x = time_value)) +
  geom_line(aes(y = value, color = name, linetype = name)) +
  scale_color_brewer(palette = "Set1") +
  scale_linetype_manual(values = c(2, 1)) +
  geom_hline(yintercept = 0) +
  facet_wrap(vars(geo_value), scales = "free_y", ncol = 1) +
  scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
  labs(x = "Date", y = "Reported COVID-19 counts") +
  theme(legend.position = "bottom", legend.title = element_blank())

More advanced correction functionality will be coming at some point in the future.