Caret vs Tidymodels: How to Use Both Packages for Machine Learning?

An example of building models with two popular packages together in R to predict bike sharing demand
Table of Contents

Max Kuhn builds both packages (with contributions from many other talented people). The caret package (short for Classification And REgression Training) streamlines the process for creating predictive models and has been the top choice among R users. It’s been around for a long time, and there are numerous resources, answers, and solutions to all the possible questions. On the other hand, tidymodels is newer and is built on the  tidyverse  principles. RStudio hired Max intending to design a tidy version of the caret.

I have been using  caret  for predictive modelling. While I am aware of  tidymodels , I only began exploring last week. As everything in life is, adopting a new ecosystem takes time and patience. So this post is by no means an exhaustive analysis. The complete code is available on GitHub, and an HTML version of the Markdown is published. The layout is optimised for desktop viewing, particularly the code chunks.

Overview

caret  is a single package with various functions for machine learning. For example,  createDataPartition  for splitting data and  trainControl  for setting up cross-validation.

tidymodels  is a collection of packages for modelling. When I execute the  library(tidymodels)  command, the following packages are loaded:

  • resample: for data splitting and resampling
  • parsnip: for trying out a range of models
  • recipes: for pre-processing
  • workflow: for putting everything together
  • yardstick: for evaluating models
  • broom: for converting the information in common statistical R objects into user-friendly, predictable formats
  • dials: for creating and managing tuning parameters

Some common libraries from  tidyverse , such as  dplyr , are also loaded. As shown,  tidymodels  breaks down the machine learning workflow into multiple stages and provides specialised packages for each stage. This is beneficial for users because of the increased flexibility and possibility. However, for a beginner, it might be intimidating (at least it was for me).

Import data

The data are from Bike Sharing Dataset of the UCI repository. The goal is to predict bike rental count  total  based on the environmental and seasonal settings.

library(tidymodels) 
library(caret)
library(lubridate)
library(tidyverse)
library(moments)
library(corrr)
library(randomForest)
bike <- read_csv("Bike-Sharing-Dataset/hour.csv")
bike %>% dim()
## [1] 17379 17

There are 17,379 cases and 17 features. I removed  instant , changed the formatting for  year , and renamed some variables.

bike %>%
mutate(instant = NULL, yr = yr + 2011) %>%
rename(
date = dteday,
year = yr,
month = mnth,
hour = hr,
weather = weathersit,
humidity = hum,
total = cnt
) ->
bike
head(bike)
Image for post
Data frame header preview

Explore data

Target variable

bike %>%
pivot_longer(
cols = c(casual, registered, total),
names_to = "usertype",
values_to = "count"
) %>%
ggplot(aes(count, colour = usertype)) +
geom_density() +
labs(
title = "Distribution of the number of rental bikes",
x = "Number per hour", y = "Density"
) +
scale_colour_discrete(
name = "User type",
breaks = c("casual", "registered", "total"),
labels = c("Non-registered", "Registered", "Total")
)
Image for post
Target variable distribution

The distributions of rental counts are positively skewed. It is desirable to have a normal distribution, as most machine learning techniques require the dependent variable to be normal. I addressed the skewness later.

Correlation

I used  correlated()  function from  corrr  package, which is part of  tidymodels  but not automatically loaded. I am constantly surprised by how many packages there are in the tidy ecosystem. Typically, I prioritise tidy packages over independent ones because of the integration of pipeline and the consistency in aesthetic, and  corrr  is no exception.

bike %>%
select(where(is.numeric)) %>%
correlate() %>%
rearrange(absolute = FALSE) %>%
shave() ->
bike_cor
rplot(bike_cor, print_cor = TRUE)
Image for post
Correlation plot

Prepare data

Since I have not split the data yet, this step is not data scaling or centring, which should fit the training set and transform the testing set. Here, I focus on the process that applies to all data and does not have a parameter, such as factorising or simple mathematic calculation. For example, if I take the square root of a number, I can square it to know the original number. However, for normalisation, I need to know the minimum and the maximum value of a variable, both of which might be different for training and testing.

Target variable

I focused on the  total  count, so  casual  and  registered  variables are moved. As suggested earlier, the target variable is positively skewed and requires transformation. I tried several common techniques for positively skewed data and applied the one with the lowest skewness — cubic root.

bike_all <- bike %>%
select(-casual, -registered)

# Original
skewness(bike_all$total)
## [1] 1.277301

# Log
skewness(log10(bike_all$total))
## [1] -0.936101

# Log + constant
skewness(log1p(bike_all$total))
## [1] -0.8181098

# Square root
skewness(sqrt(bike_all$total))
## [1] 0.2864499

# Cubic root
skewness(bike_all$total^(1 / 3))
## [1] -0.0831688

# Transform with cubic root
bike_all$total <- bike_all$total^(1 / 3)

Predictors

Categorical variables are converted to factors according to the attribute information provided by UCI.

bike_all$season <- factor(
bike_all$season,
levels = c(1, 2, 3, 4),
labels = c("spring", "summer", "autumn", "winter")
)
bike_all$holiday <- factor(
bike_all$holiday,
levels = c(0, 1), labels = c(FALSE, TRUE)
)
bike_all$workingday <- factor(
bike_all$workingday,
levels = c(0, 1), labels = c(FALSE, TRUE)
)
bike_all$weather <- factor(
bike_all$weather,
levels = c(1, 2, 3, 4),
labels = c("clear", "cloudy", "rainy", "heavy rain"),
ordered = TRUE
)
head(bike_all)
Image for post
Data frame header preview

Split data (Train/Test, Cross-Validation)

Both packages provide functions for common data splitting strategies, such as k-fold, grouped k-fold, leave-out-one, and bootstrapping. But tidyverse appears to be more comprehensive since it includes Monte Carlo cross-validation (I don’t know what this is, but it sounds cool) and nested cross-validation. I particularly emphasised the method because a research paper found that, “nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size.” (Vabalas et al., 2019)

tidymodels

The tidymodels  rsample  library handles data splitting. Training and testing split is done as shown, along with 10-fold cross-validation.

set.seed(25)
split <- initial_split(bike_all, prop = 0.8)
train_data <- training(split)
train_data %>% dim()
## [1] 13904 14

test_data <- testing(split)
test_data %>% dim()
## [1] 3475 14

train_cv <- vfold_cv(train_data, v = 10)

caret

There are two options available:

  • Use native functions:
set.seed(25)
train_index <- createDataPartition(
bike_all$total, p = 0.8, times = 1, list = FALSE
)
train_data <- mics[ train_index, ]
test_data <- mics[-train_index, ]

fold_index <- createFolds(
train_data$total,
k = 10, returnTrain = TRUE, list = TRUE
)
train_cv <- trainControl(method="cv", index = fold_index)
  • Use tidymodels’  rsample2caret  function, which returns a list that mimics the  index  and  indexOut  elements of a  trainControl  object.
train_cv_caret <- rsample2caret(train_cv)
ctrl_caret <- trainControl(
method = "cv",
index = train_cv_caret$index,
indexOut = train_cv_caret$indexOut
)

Two packages are pretty similar. It’s interesting that trainControl only specifies the cross-validation strategy but not the data. Nonetheless, thanks to  rsample2caret()  and  caret2rsample()  commands, it’s easy to set up resampling in whichever package that you prefer. Here I used  rsample2caret()  to generate 10-fold indices for  caret  to ensure the cross-validation are identical for both.

Preprocess data

In  caret , one function  preProcess  covers all pre-processing for numerical features, including imputation, centring, scaling, and power transformation. For categorical features, I can either use  dummyVars  to create dummy variables or leave it to  train , which handles factors during model training. One thing I find frustrating is that I cannot specify pre-processing for different variables. For example, I want to use the Box-Cox transformation to normalise an extremely skewed variable. But,  preProcess  performs the transformation on all predictors. If you’re familiar with  sklearn  in Python, I am saying that I want  ColumnTransformer.

tidymodels

recipes  delivers the wish, allowing me to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. “feature engineering”). Moreover,  recipes  seems to offer more preprocessing methods. However, unlike  caret ,  recipes  does not automatically handle the categorical variables, and I need to create dummy variables manually. Still, the ability to tailor preprocessing trumps the small inconvenience of having to generate dummy variables.

prep_recipe <-
recipe(total ~ ., data = train_data) %>%
step_rm(year, month, weekday) %>%
step_date(date) %>%
step_corr(all_numeric(), threshold = 0.8) %>%
step_dummy(all_nominal())

caret

Similarly, I can use caret’s  preProcess()  function. But I always find it frustrating because all numerical variables are processed, and there is not much flexibility.

prep <- preProcess(cutoff = 0.8) 

Again, tidymodels’ recipe can be used for caret. Here, I  prep()  and  bake()  to transform the data because caret does not have a workflow function.

train_data_caret <-
prep(prep_recipe) %>% bake(new_data = NULL)

test_data_caret <-
prep(prep_recipe) %>% bake(new_data = test_data)

Train models

Two custom functions I will use later:

# Generate prediction tables
predict_table <- function(model, data, tidy_flag) {
if (tidy_flag == TRUE) {
result <- model %>%
predict(data) %>%
rename(pred = .pred) %>%
mutate(
actual = data$total,
pred_real = pred^3,
actual_real = actual^3
)
} else {
result <- model %>%
predict(data) %>%
as_tibble_col(column_name = "pred") %>%
mutate(
actual = data$total,
pred_real = pred^3,
actual_real = actual^3
)
}
result
}

# Extract RMSE for models
pull_rmse <- function(result_table) {
rmse_result <- rmse(result_table, pred, actual) %>%
pull(.estimate)
rmse_result_real <- rmse(result_table, pred_real, actual_real) %>%
pull(.estimate)
result <- c(rmse = rmse_result, real_rmse = rmse_result_real)
}

Baseline

The baseline is the average of the  total.

base_train_pred <-
tibble(
actual = train_data$total,
actual_real = train_data$total^3
) %>%
mutate(pred = mean(actual), pred_real = mean(actual_real))base_test_pred <-
tibble(
actual = test_data$total,
actual_real = test_data$total^3
) %>%
mutate(pred = mean(actual), pred_real = mean(actual_real))base_train_rmse <- pull_rmse(base_train_pred)
print(base_train_rmse)
## rmse real_rmse
## 2.032927 181.063306base_test_rmse <- pull_rmse(base_test_pred)
print(base_test_rmse)
## rmse real_rmse
## 2.02608 182.61370

Decision trees with tidymodels

parsnip  for modelling,  workflow  for well… workflow,  tune  for parameter tuning, and  yardstick  for performance metrics. I was also curious about the timing, so I recorded the time as well.

# Cost complexity for decision tree parameter
tree_cp <- seq(0.01, 0.1, 0.01)

set.seed(25)
tree_tidy_time1 <- Sys.time()

# Specify model
tree_engine <-
decision_tree(mode = "regression", cost_complexity = tune()) %>%
set_engine("rpart")

# Set workflow (Preprocess & model)
tree_workflow <-
workflow() %>%
add_recipe(prep_recipe) %>%
add_model(tree_engine)

# Tune parameters with cross-validation
tree_tune <- tune_grid(
tree_workflow,
resamples = train_cv,
grid = data.frame(cost_complexity = tree_cp),
metrics = metric_set(rmse)
)

# Fit again with the best parameter
tree_best <-
finalize_workflow(tree_workflow, select_best(tree_tune)) %>%
fit(train_data)

tree_tidy_time2 <- Sys.time()
print(tree_tidy_time2 - tree_tidy_time1)
## Time difference of 1.376683 mins

It takes around 1 minute and 20 seconds to cross-validate ten parameters. Once it’s done, I can predict the target variable and examine model performance with RMSE. Here I used custom functions  predict_table  and  pull_rmse  to complete the task.

tree_tidy_train_pred <- predict_table(tree_best, train_data, TRUE)
tree_tidy_train_rmse <- pull_rmse(tree_tidy_train_pred)
print(tree_tidy_train_rmse)
## rmse real_rmse
## 1.078724 116.106006tree_tidy_test_pred <- predict_table(tree_best, test_data, TRUE)
tree_tidy_test_rmse <- pull_rmse(tree_tidy_test_pred)
print(tree_tidy_test_rmse)
## rmse real_rmse
## 1.074347 118.205989

Decision trees with caret

set.seed(25)
tree_caret_time1 <- Sys.time()
tree_caret <- train(
total~.,
data = train_data_caret,
method = "rpart",
trControl = ctrl_caret,
metric = "RMSE",
tuneGrid = data.frame(cp = tree_cp)
)
tree_caret_time2 <- Sys.time()
print(tree_caret_time2 - tree_caret_time1)
## Time difference of 4.469931 secs

Wooooow! It only takes 4.5 seconds. Moreover, the code is much shorter. The  train  function includes the model  method = "rpart" , cross-validation  trControl = ctrl_caret , and parameter tuning  tuneGrid = data.frame(cp = tree_cp) .

tree_caret_train_pred <- predict_table(tree_caret, train_data_caret, FALSE)
tree_caret_train_rmse <- pull_rmse(tree_caret_train_pred)
print(tree_caret_train_rmse)
## rmse real_rmse
## 1.078724 116.106006tree_caret_test_pred <- predict_table(tree_caret, test_data_caret, FALSE)
tree_caret_test_rmse <- pull_rmse(tree_caret_test_pred)
print(tree_caret_test_rmse)
## rmse real_rmse
## 1.074347 118.205989

Compare models

rbind(
base_train_rmse, base_test_rmse,
tree_tidy_train_rmse, tree_tidy_test_rmse,
tree_caret_train_rmse, tree_caret_test_rmse
)## rmse real_rmse
## base_train_rmse 2.032927 181.0633
## base_test_rmse 2.026080 182.6137
## tree_tidy_train_rmse 1.078724 116.1060
## tree_tidy_test_rmse 1.074347 118.2060
## tree_caret_train_rmse 1.078724 116.1060
## tree_caret_test_rmse 1.074347 118.2060

As you can see, the decision tree model results are the same regardless of the library, since I split the data and set up cross-validation the same way. Moreover, both tidymodels and caret use  rpart  as the underlying engine. So it seems strange that tidymodels takes over 1 minute while caret only needs 4–5 seconds to run decision tree.

Conclusion

I have been using tidymodels for a few weeks now, and I really enjoy integrating to the tidyverse. But I find it confusing to have so many steps and objects. For example, I keep trying to get RMSE from a workflow object. I probably need a bit more time to get familiar with the new ecosystem.

The complete code is available on GitHub, and an HTML version of the Markdown is published. As always, I hope this post is helpful. Have a wonderful day!

Reference

Vabalas, A., Gowen, E., Poliakoff, E., & Casson, A. J. (2019). Machine learning algorithm validation with a limited sample size. PloS one14(11), e0224365.