library(tidymodels)
library(broom)
library(performance)
12 Regression in Tidymodels
This vignette is a gentle introduction into performing simple and multiple linear regression using tidymodels
. Model fitting will be done using parsnip, which provides a unifying interface for model fitting and the resulting output. This means that parsnip provides a single interface with standardized argument names for each class of models so that you don’t have to directly deal with the different interfaces for different functions that aim to do the same thing (like linear regression). See here for a list of models that parsnip
currently supports.
12.1 Libraries
12.2 Simple linear regression
The key steps to perform linear regression in tidymodels
are to first specify the model type and then to specify the model form and the data to be used to construct it.
To illustrate, we shall look to penguins
dataset from the tidymodels
’ modeldata
package. This dataset contains measurements for 344 penguins from three islands in Palmer Archipelago, Antarctica, and includes information on their species, island home, size (flipper length, body mass, bill dimensions), and sex.
# Let's inspect the data
head(penguins)
#> # A tibble: 6 × 7
#> species island bill_length_mm bill_depth_mm flipper_length_mm
#> <fct> <fct> <dbl> <dbl> <int>
#> 1 Adelie Torgersen 39.1 18.7 181
#> 2 Adelie Torgersen 39.5 17.4 186
#> 3 Adelie Torgersen 40.3 18 195
#> 4 Adelie Torgersen NA NA NA
#> 5 Adelie Torgersen 36.7 19.3 193
#> 6 Adelie Torgersen 39.3 20.6 190
#> # ℹ 2 more variables: body_mass_g <int>, sex <fct>
One thing you may have spotted is that there’s missing data in this dataset in the fourth row. For simplicity, we will only work with the complete cases. This reduces the number of rows in our dataset to 333.
<- penguins %>%
penguins filter(complete.cases(.))
head(penguins)
#> # A tibble: 6 × 7
#> species island bill_length_mm bill_depth_mm flipper_length_mm
#> <fct> <fct> <dbl> <dbl> <int>
#> 1 Adelie Torgersen 39.1 18.7 181
#> 2 Adelie Torgersen 39.5 17.4 186
#> 3 Adelie Torgersen 40.3 18 195
#> 4 Adelie Torgersen 36.7 19.3 193
#> 5 Adelie Torgersen 39.3 20.6 190
#> 6 Adelie Torgersen 38.9 17.8 181
#> # ℹ 2 more variables: body_mass_g <int>, sex <fct>
Much better! We will now build a simple linear regression model to model bill length as a function of bill depth.
In parsnip
, the model specification is broken down into small functions such as set_mode()
and set_engine()
to make the interface more flexible and readable. The general structure is to first specify a mode (regression or classification) and then an engine to indicate what software (or implementation of the algorithm) will be used to fit the model. For our purposes, the mode is regression
and the engine is lm
for ordinary least squares. You may note that setting the mode is unnecessary for linear regression, but we include it here as it is a good practice.
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
The above specification does not actually carry out the regression, rather it just states what we would like to do.
lm_spec
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
Once we have such a blueprint, we may fit a model by inputting data and a formula. Recall that in R, a formula takes the form y ~ x
where y
ix the response and x
is the predictor variable. For our example, where the response of bill length and predictor of bill depth, we would write the formula as bill_length_mm ~ bill_depth_mm
.
Unlike with standard R formula()
objects, the names used this a formula must be identical to the variable names in the dataset. No processing functions are allowed (processing is handled by the recipe()
).
<- lm_spec %>%
lm_fit fit(bill_length_mm ~ bill_depth_mm, data = penguins)
lm_fit
#> parsnip model object
#>
#>
#> Call:
#> stats::lm(formula = bill_length_mm ~ bill_depth_mm, data = data)
#>
#> Coefficients:
#> (Intercept) bill_depth_mm
#> 54.8909 -0.6349
The resulting parsnip
object includes basic information about the fit such as the model coefficients. To access the underlying fit object, we could use the standard lm_fit$fit
or with purrr
’s pluck()
function.
%>%
lm_fit pluck("fit")
#>
#> Call:
#> stats::lm(formula = bill_length_mm ~ bill_depth_mm, data = data)
#>
#> Coefficients:
#> (Intercept) bill_depth_mm
#> 54.8909 -0.6349
To get additional information about the fit (such as standard errors, and goodness-of-fit statistics), we can get a summary of the model fit as follows:
%>%
lm_fit pluck("fit") %>%
summary()
#>
#> Call:
#> stats::lm(formula = bill_length_mm ~ bill_depth_mm, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -12.9498 -3.9530 -0.3657 3.7327 15.5025
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 54.8909 2.5673 21.380 < 2e-16 ***
#> bill_depth_mm -0.6349 0.1486 -4.273 2.53e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5.332 on 331 degrees of freedom
#> Multiple R-squared: 0.05227, Adjusted R-squared: 0.04941
#> F-statistic: 18.26 on 1 and 331 DF, p-value: 2.528e-05
To get a tidy summary of the model parameter estimates, simply use the tidy function from the broom package on the model fit. To extract model statistics, glance()
can be used.
tidy(lm_fit)
#> # A tibble: 2 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 54.9 2.57 21.4 2.54e-64
#> 2 bill_depth_mm -0.635 0.149 -4.27 2.53e- 5
glance(lm_fit)
#> # A tibble: 1 × 12
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0523 0.0494 5.33 18.3 0.0000253 1 -1029. 2064. 2075.
#> # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Now, to make predictions, we simply use predict()
on the parnsip model object. In there, we must specify the dataset we want to predict on in the new_data
argument. Note that this may be a different dataset than we used for fitting the model, but this input data must include all predictor variables that were used to fit the model.
predict(lm_fit, new_data = penguins)
#> # A tibble: 333 × 1
#> .pred
#> <dbl>
#> 1 43.0
#> 2 43.8
#> 3 43.5
#> 4 42.6
#> 5 41.8
#> 6 43.6
#> # ℹ 327 more rows
For parnsip models, the predictions are always outputted in a tibble.
To specify the type of prediction made, modify type
argument. If we set type = "conf_int"
, we get a 95% confidence interval.
predict(lm_fit, new_data = penguins, type = "conf_int")
#> # A tibble: 333 × 2
#> .pred_lower .pred_upper
#> <dbl> <dbl>
#> 1 42.3 43.7
#> 2 43.3 44.4
#> 3 42.8 44.1
#> 4 41.8 43.5
#> 5 40.7 43.0
#> 6 43.0 44.2
#> # ℹ 327 more rows
To evaluate model predictive performance, it is logical to compare the each of the observed and predicted values. To see these values side-by-side we simply bind the two vectors of interest.
bind_cols(
predict(lm_fit, new_data = penguins),
penguins%>%
) select(bill_length_mm, .pred)
#> # A tibble: 333 × 2
#> bill_length_mm .pred
#> <dbl> <dbl>
#> 1 39.1 43.0
#> 2 39.5 43.8
#> 3 40.3 43.5
#> 4 36.7 42.6
#> 5 39.3 41.8
#> 6 38.9 43.6
#> # ℹ 327 more rows
A simpler way to do this is to use the nifty augment()
function.
augment(lm_fit, new_data = penguins) %>%
select(bill_length_mm, .pred)
#> # A tibble: 333 × 2
#> bill_length_mm .pred
#> <dbl> <dbl>
#> 1 39.1 43.0
#> 2 39.5 43.8
#> 3 40.3 43.5
#> 4 36.7 42.6
#> 5 39.3 41.8
#> 6 38.9 43.6
#> # ℹ 327 more rows
12.3 Multiple linear regression
The only difference about fitting a multiple linear regression model in comparison to a simple linear regression model lies the formula. For multiple linear regression, the predictors are specified in the formula expression, separated by +
. For example, if we have a response variable y
and three predictors, x1, x2,
and x3
, we would write the formula as, y ~ x1 + x2 + x3
.
<- lm_spec %>% fit(
lm_fit2 formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm + body_mass_g,
data = penguins
) lm_fit2
#> parsnip model object
#>
#>
#> Call:
#> stats::lm(formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm +
#> body_mass_g, data = data)
#>
#> Coefficients:
#> (Intercept) bill_depth_mm flipper_length_mm body_mass_g
#> -2.571e+01 6.131e-01 2.872e-01 3.472e-04
Everything else proceeds much the same as before. Such as obtaining parameter estimates
tidy(lm_fit2)
#> # A tibble: 4 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -25.7 6.72 -3.83 1.55e- 4
#> 2 bill_depth_mm 0.613 0.138 4.43 1.26e- 5
#> 3 flipper_length_mm 0.287 0.0351 8.18 6.28e-15
#> 4 body_mass_g 0.000347 0.000566 0.614 5.40e- 1
as well as predicting new values.
predict(lm_fit2, new_data = penguins)
#> # A tibble: 333 × 1
#> .pred
#> <dbl>
#> 1 39.0
#> 2 39.7
#> 3 42.5
#> 4 42.8
#> 5 42.8
#> 6 38.4
#> # ℹ 327 more rows
If you would like to use all variables aside from your response as predictors, a shortcut is to use the formula form y ~ .
<- lm_spec %>% fit(bill_length_mm ~ ., data = penguins)
lm_fit3 lm_fit3
#> parsnip model object
#>
#>
#> Call:
#> stats::lm(formula = bill_length_mm ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) speciesChinstrap speciesGentoo islandDream
#> 15.343291 9.835502 6.117675 -0.503815
#> islandTorgersen bill_depth_mm flipper_length_mm body_mass_g
#> -0.127431 0.300670 0.069257 0.001081
#> sexmale
#> 2.047859
12.4 Checking model assumptions
After fitting a model, it is good to check whether the assumptions of linear regression are met. For this, we will use the performance
package, in particular the check_model()
function to produce several helpful plots we may use to check the assumptions for our first multiple linear regression model.
%>%
lm_fit2 extract_fit_engine() %>%
check_model()
Notice that on each plot it says what we should expect to see if the model assumption is met.
We shall now briefly walk you through what each plot means.
The first two plots help us to examine the linearity of the errors versus the fitted values. Ideally, we want this error to be relatively flat and horizontal. The third plot is for checking homogeneity of the variance, where we want the points to be roughly the same distance from the line as this indicates similar dispersion. The fourth plot helps us to see if there are high leverage points - points that have command or influence over the model fit. As a result, these can have a great effect on the model predictions. So the removal of such points or modifications to the model may be necessary to deal with them. The fifth plot helps us to discern collinearity, which is when predictors are highly correlated. Since independent variables should be independent, this can throw off simple regression models (in standard error of coefficient estimates and the estimates themselves, which would likely be sensitive to changes in the predictors that are included in the model). The last plot enables us to check the normality of residuals. If the distribution of the model error is non-normal, then that suggests a linear model may not be appropriate. For a QQ plot, we want the points to fall along a straight diagonal line.
For our example, we observe that there’s a pretty high correlation between body_mass_g
and flipper_length_mm
(not quite in the red-zone of 10 and above, but close enough for concern). That is indicative of multicollinearity between them. Intuitively, it makes sense for the body mass and flipper length variables - we’d expect that as once increases, so should the other.
We can take a closer look at the correlation by whipping up a correlation matrix by using base R’s cor()
function. Since for collinearity we’re only usually interested in the numerical predictors, we’ll only include the four numeric variables.
<- penguins %>%
penguins_corr select(body_mass_g, ends_with("_mm")) %>%
cor()
penguins_corr
#> body_mass_g bill_length_mm bill_depth_mm flipper_length_mm
#> body_mass_g 1.0000000 0.5894511 -0.4720157 0.8729789
#> bill_length_mm 0.5894511 1.0000000 -0.2286256 0.6530956
#> bill_depth_mm -0.4720157 -0.2286256 1.0000000 -0.5777917
#> flipper_length_mm 0.8729789 0.6530956 -0.5777917 1.0000000
Indeed body_mass_g
and flipper_length_mm
are highly positively correlated. To deal with this problem, we’ll re-fit the model without body_mass_g
.
<- lm_spec %>% fit(
lm_fit3 formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm,
data = penguins
) lm_fit3
#> parsnip model object
#>
#>
#> Call:
#> stats::lm(formula = bill_length_mm ~ bill_depth_mm + flipper_length_mm,
#> data = data)
#>
#> Coefficients:
#> (Intercept) bill_depth_mm flipper_length_mm
#> -27.9762 0.6200 0.3052
and then check again to see whether the assumptions are met.
%>%
lm_fit3 extract_fit_engine() %>%
check_model()
Overall, the plots look pretty good. For details on how to interpret each of these plots and more details about model assumptions please see here and here.
12.5 Interaction terms
In general, the syntax to add an interaction term to a formula is as follows:
x:y
denotes an interaction term betweenx
andy
.x*y
denotes the interaction betweenx
andy
as well asx
andy
; that is,x + y + x*y
.
It is important to note that this syntax is not compatible with all engines. Thus, we shall explain how to bypass this issue by adding an interaction term in a recipe later on. For now, let’s start simple by adding an interaction term between species
and bill_length_mm
, which allows for a species-specific slope.
<- lm_spec %>% fit(
lm_fit4 formula = bill_length_mm ~ species * bill_depth_mm,
data = penguins
) lm_fit4
#> parsnip model object
#>
#>
#> Call:
#> stats::lm(formula = bill_length_mm ~ species * bill_depth_mm,
#> data = data)
#>
#> Coefficients:
#> (Intercept) speciesChinstrap
#> 23.3668 -9.9389
#> speciesGentoo bill_depth_mm
#> -6.6966 0.8425
#> speciesChinstrap:bill_depth_mm speciesGentoo:bill_depth_mm
#> 1.0796 1.2178
Using recipes, the interaction term is specified by using step_interact()
. Then we construct a workflow object, where we add the linear regression model specification and recipe. Finally, we fit the model as we did for a parsnip
model. Note that the workflow object does not need the variables that were specified in the recipe to be specified again.
<- recipe(
rec_spec_interact formula = bill_length_mm ~ species + bill_depth_mm,
data = penguins
%>%
) step_interact(~ species:bill_depth_mm)
<- workflow() %>%
lm_wf_interact add_model(lm_spec) %>%
add_recipe(rec_spec_interact)
%>% fit(penguins) lm_wf_interact
#> ══ Workflow [trained] ═══════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ─────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_interact()
#>
#> ── Model ────────────────────────────────────────────────────────────────────
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) speciesChinstrap
#> 23.3668 -9.9389
#> speciesGentoo bill_depth_mm
#> -6.6966 0.8425
#> speciesChinstrap_x_bill_depth_mm speciesGentoo_x_bill_depth_mm
#> 1.0796 1.2178
Notice the variable name for the interaction term is not the same as it is in base R (which is simply of the form x:y
). In step_interact()
, the default separator between the variable names is _x_
. You can change this default by specifying the sep
argument in the function.
To read more about formula syntax, see ?formula.
12.6 Non-linear transformations of the predictors
Similar to how we were able to add an interaction term using recipes, we can also perform a transformation as a pre-processing step. The function used for this is step_mutate()
(which acts like dplyr
’s mutate
).
Note that, in general, if you are specifying a recipe aim to keep as much of the pre-processing in your recipe specification as possible. This helps to ensure that the transformation will be applied to new data consistently.
<- recipe(bill_length_mm ~ bill_depth_mm, data = penguins) %>%
rec_spec_pow2 step_mutate(bill_depth_mm2 = bill_depth_mm^2)
<- workflow() %>%
lm_wf_pow2 add_model(lm_spec) %>%
add_recipe(rec_spec_pow2)
%>% fit(penguins) lm_wf_pow2
#> ══ Workflow [trained] ═══════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ─────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_mutate()
#>
#> ── Model ────────────────────────────────────────────────────────────────────
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) bill_depth_mm bill_depth_mm2
#> 95.2558 -5.4431 0.1413
There are many transformations already built into recipes such as step_log()
. So, for basic transformations, there’s often no need to make your own transformation from scratch. See here for a comprehensive list of the transformations that are offered in recipes.
<- recipe(bill_length_mm ~ bill_depth_mm, data = penguins) %>%
rec_spec_log step_log(bill_depth_mm) # transforms the var in-place, keeps it's name
<- workflow() %>%
lm_wf_log add_model(lm_spec) %>%
add_recipe(rec_spec_log)
%>% fit(penguins) lm_wf_log
#> ══ Workflow [trained] ═══════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ─────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_log()
#>
#> ── Model ────────────────────────────────────────────────────────────────────
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) bill_depth_mm
#> 74.95 -10.91
🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧
12.7 Attribution
This Chapter was largely adapted from Chapter 3 of ISLR tidymodels labs. Checking linear regression assumptions using the performance package is based on this article and this blog post on investigating model performance. The artwork used is by Allison Horst.Allison Horst.
🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧 🐧