Max Kuhn builds both packages (with contributions from many other talented people). The caret package (short for Classification And REgression Training) streamlines the process for creating predictive models and has been the top choice among R users. It’s been around for a long time, and there are numerous resources, answers, and solutions to all the possible questions. On the other hand, tidymodels is newer and is built on the tidyverse
principles. RStudio hired Max intending to design a tidy version of the caret.
I have been using caret
for predictive modelling. While I am aware of tidymodels
, I only began exploring last week. As everything in life is, adopting a new ecosystem takes time and patience. So this post is by no means an exhaustive analysis. The complete code is available on GitHub, and an HTML version of the Markdown is published. The layout is optimised for desktop viewing, particularly the code chunks.
Overview
caret
is a single package with various functions for machine learning. For example, createDataPartition
for splitting data and trainControl
for setting up cross-validation.
tidymodels
is a collection of packages for modelling. When I execute the library(tidymodels)
command, the following packages are loaded:
resample
: for data splitting and resamplingparsnip
: for trying out a range of modelsrecipes
: for pre-processingworkflow
: for putting everything togetheryardstick
: for evaluating modelsbroom
: for converting the information in common statistical R objects into user-friendly, predictable formatsdials
: for creating and managing tuning parameters
Some common libraries from tidyverse
, such as dplyr
, are also loaded. As shown, tidymodels
breaks down the machine learning workflow into multiple stages and provides specialised packages for each stage. This is beneficial for users because of the increased flexibility and possibility. However, for a beginner, it might be intimidating (at least it was for me).
Import data
The data are from Bike Sharing Dataset of the UCI repository. The goal is to predict bike rental count total
based on the environmental and seasonal settings.
library(tidymodels)
library(caret)
library(lubridate)
library(tidyverse)
library(moments)
library(corrr)
library(randomForest)
bike <- read_csv("Bike-Sharing-Dataset/hour.csv")
bike %>% dim()
## [1] 17379 17
There are 17,379 cases and 17 features. I removed instant
, changed the formatting for year
, and renamed some variables.
bike %>%
mutate(instant = NULL, yr = yr + 2011) %>%
rename(
date = dteday,
year = yr,
month = mnth,
hour = hr,
weather = weathersit,
humidity = hum,
total = cnt
) ->
bike
head(bike)
Explore data
Target variable
bike %>%
pivot_longer(
cols = c(casual, registered, total),
names_to = "usertype",
values_to = "count"
) %>%
ggplot(aes(count, colour = usertype)) +
geom_density() +
labs(
title = "Distribution of the number of rental bikes",
x = "Number per hour", y = "Density"
) +
scale_colour_discrete(
name = "User type",
breaks = c("casual", "registered", "total"),
labels = c("Non-registered", "Registered", "Total")
)
The distributions of rental counts are positively skewed. It is desirable to have a normal distribution, as most machine learning techniques require the dependent variable to be normal. I addressed the skewness later.
Correlation
I used correlated()
function from corrr
package, which is part of tidymodels
but not automatically loaded. I am constantly surprised by how many packages there are in the tidy ecosystem. Typically, I prioritise tidy packages over independent ones because of the integration of pipeline and the consistency in aesthetic, and corrr
is no exception.
bike %>%
select(where(is.numeric)) %>%
correlate() %>%
rearrange(absolute = FALSE) %>%
shave() ->
bike_cor
rplot(bike_cor, print_cor = TRUE)
Prepare data
Since I have not split the data yet, this step is not data scaling or centring, which should fit the training set and transform the testing set. Here, I focus on the process that applies to all data and does not have a parameter, such as factorising or simple mathematic calculation. For example, if I take the square root of a number, I can square it to know the original number. However, for normalisation, I need to know the minimum and the maximum value of a variable, both of which might be different for training and testing.
Target variable
I focused on the total
count, so casual
and registered
variables are moved. As suggested earlier, the target variable is positively skewed and requires transformation. I tried several common techniques for positively skewed data and applied the one with the lowest skewness — cubic root.
bike_all <- bike %>%
select(-casual, -registered)
# Original
skewness(bike_all$total)
## [1] 1.277301
# Log
skewness(log10(bike_all$total))
## [1] -0.936101
# Log + constant
skewness(log1p(bike_all$total))
## [1] -0.8181098
# Square root
skewness(sqrt(bike_all$total))
## [1] 0.2864499
# Cubic root
skewness(bike_all$total^(1 / 3))
## [1] -0.0831688
# Transform with cubic root
bike_all$total <- bike_all$total^(1 / 3)
Predictors
Categorical variables are converted to factors according to the attribute information provided by UCI.
bike_all$season <- factor(
bike_all$season,
levels = c(1, 2, 3, 4),
labels = c("spring", "summer", "autumn", "winter")
)
bike_all$holiday <- factor(
bike_all$holiday,
levels = c(0, 1), labels = c(FALSE, TRUE)
)
bike_all$workingday <- factor(
bike_all$workingday,
levels = c(0, 1), labels = c(FALSE, TRUE)
)
bike_all$weather <- factor(
bike_all$weather,
levels = c(1, 2, 3, 4),
labels = c("clear", "cloudy", "rainy", "heavy rain"),
ordered = TRUE
)
head(bike_all)
Split data (Train/Test, Cross-Validation)
Both packages provide functions for common data splitting strategies, such as k-fold, grouped k-fold, leave-out-one, and bootstrapping. But tidyverse
appears to be more comprehensive since it includes Monte Carlo cross-validation (I don’t know what this is, but it sounds cool) and nested cross-validation. I particularly emphasised the method because a research paper found that, “nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size.” (Vabalas et al., 2019)
tidymodels
The tidymodels rsample
library handles data splitting. Training and testing split is done as shown, along with 10-fold cross-validation.
set.seed(25)
split <- initial_split(bike_all, prop = 0.8)
train_data <- training(split)
train_data %>% dim()
## [1] 13904 14
test_data <- testing(split)
test_data %>% dim()
## [1] 3475 14
train_cv <- vfold_cv(train_data, v = 10)
caret
There are two options available:
- Use native functions:
set.seed(25)
train_index <- createDataPartition(
bike_all$total, p = 0.8, times = 1, list = FALSE
)
train_data <- mics[ train_index, ]
test_data <- mics[-train_index, ]
fold_index <- createFolds(
train_data$total,
k = 10, returnTrain = TRUE, list = TRUE
)
train_cv <- trainControl(method="cv", index = fold_index)
- Use tidymodels’
rsample2caret
function, which returns a list that mimics theindex
andindexOut
elements of atrainControl
object.
train_cv_caret <- rsample2caret(train_cv)
ctrl_caret <- trainControl(
method = "cv",
index = train_cv_caret$index,
indexOut = train_cv_caret$indexOut
)
Two packages are pretty similar. It’s interesting that trainControl
only specifies the cross-validation strategy but not the data. Nonetheless, thanks to rsample2caret()
and caret2rsample()
commands, it’s easy to set up resampling in whichever package that you prefer. Here I used rsample2caret()
to generate 10-fold indices for caret
to ensure the cross-validation are identical for both.
Preprocess data
In caret
, one function preProcess
covers all pre-processing for numerical features, including imputation, centring, scaling, and power transformation. For categorical features, I can either use dummyVars
to create dummy variables or leave it to train
, which handles factors during model training. One thing I find frustrating is that I cannot specify pre-processing for different variables. For example, I want to use the Box-Cox transformation to normalise an extremely skewed variable. But, preProcess
performs the transformation on all predictors. If you’re familiar with sklearn
in Python, I am saying that I want ColumnTransformer
.
tidymodels
recipes
delivers the wish, allowing me to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. “feature engineering”). Moreover, recipes
seems to offer more preprocessing methods. However, unlike caret
, recipes
does not automatically handle the categorical variables, and I need to create dummy variables manually. Still, the ability to tailor preprocessing trumps the small inconvenience of having to generate dummy variables.
prep_recipe <-
recipe(total ~ ., data = train_data) %>%
step_rm(year, month, weekday) %>%
step_date(date) %>%
step_corr(all_numeric(), threshold = 0.8) %>%
step_dummy(all_nominal())
caret
Similarly, I can use caret’s preProcess()
function. But I always find it frustrating because all numerical variables are processed, and there is not much flexibility.
prep <- preProcess(cutoff = 0.8)
Again, tidymodels’ recipe
can be used for caret. Here, I prep()
and bake()
to transform the data because caret does not have a workflow function.
train_data_caret <-
prep(prep_recipe) %>% bake(new_data = NULL)
test_data_caret <-
prep(prep_recipe) %>% bake(new_data = test_data)
Train models
Two custom functions I will use later:
# Generate prediction tables
predict_table <- function(model, data, tidy_flag) {
if (tidy_flag == TRUE) {
result <- model %>%
predict(data) %>%
rename(pred = .pred) %>%
mutate(
actual = data$total,
pred_real = pred^3,
actual_real = actual^3
)
} else {
result <- model %>%
predict(data) %>%
as_tibble_col(column_name = "pred") %>%
mutate(
actual = data$total,
pred_real = pred^3,
actual_real = actual^3
)
}
result
}
# Extract RMSE for models
pull_rmse <- function(result_table) {
rmse_result <- rmse(result_table, pred, actual) %>%
pull(.estimate)
rmse_result_real <- rmse(result_table, pred_real, actual_real) %>%
pull(.estimate)
result <- c(rmse = rmse_result, real_rmse = rmse_result_real)
}
Baseline
The baseline is the average of the total
.
base_train_pred <-
tibble(
actual = train_data$total,
actual_real = train_data$total^3
) %>%
mutate(pred = mean(actual), pred_real = mean(actual_real))base_test_pred <-
tibble(
actual = test_data$total,
actual_real = test_data$total^3
) %>%
mutate(pred = mean(actual), pred_real = mean(actual_real))base_train_rmse <- pull_rmse(base_train_pred)
print(base_train_rmse)
## rmse real_rmse
## 2.032927 181.063306base_test_rmse <- pull_rmse(base_test_pred)
print(base_test_rmse)
## rmse real_rmse
## 2.02608 182.61370
Decision trees with tidymodels
parsnip
for modelling, workflow
for well… workflow, tune
for parameter tuning, and yardstick
for performance metrics. I was also curious about the timing, so I recorded the time as well.
# Cost complexity for decision tree parameter
tree_cp <- seq(0.01, 0.1, 0.01)
set.seed(25)
tree_tidy_time1 <- Sys.time()
# Specify model
tree_engine <-
decision_tree(mode = "regression", cost_complexity = tune()) %>%
set_engine("rpart")
# Set workflow (Preprocess & model)
tree_workflow <-
workflow() %>%
add_recipe(prep_recipe) %>%
add_model(tree_engine)
# Tune parameters with cross-validation
tree_tune <- tune_grid(
tree_workflow,
resamples = train_cv,
grid = data.frame(cost_complexity = tree_cp),
metrics = metric_set(rmse)
)
# Fit again with the best parameter
tree_best <-
finalize_workflow(tree_workflow, select_best(tree_tune)) %>%
fit(train_data)
tree_tidy_time2 <- Sys.time()
print(tree_tidy_time2 - tree_tidy_time1)
## Time difference of 1.376683 mins
It takes around 1 minute and 20 seconds to cross-validate ten parameters. Once it’s done, I can predict the target variable and examine model performance with RMSE. Here I used custom functions predict_table
and pull_rmse
to complete the task.
tree_tidy_train_pred <- predict_table(tree_best, train_data, TRUE)
tree_tidy_train_rmse <- pull_rmse(tree_tidy_train_pred)
print(tree_tidy_train_rmse)
## rmse real_rmse
## 1.078724 116.106006tree_tidy_test_pred <- predict_table(tree_best, test_data, TRUE)
tree_tidy_test_rmse <- pull_rmse(tree_tidy_test_pred)
print(tree_tidy_test_rmse)
## rmse real_rmse
## 1.074347 118.205989
Decision trees with caret
set.seed(25)
tree_caret_time1 <- Sys.time()
tree_caret <- train(
total~.,
data = train_data_caret,
method = "rpart",
trControl = ctrl_caret,
metric = "RMSE",
tuneGrid = data.frame(cp = tree_cp)
)
tree_caret_time2 <- Sys.time()
print(tree_caret_time2 - tree_caret_time1)
## Time difference of 4.469931 secs
Wooooow! It only takes 4.5 seconds. Moreover, the code is much shorter. The train
function includes the model method = "rpart"
, cross-validation trControl = ctrl_caret
, and parameter tuning tuneGrid = data.frame(cp = tree_cp)
.
tree_caret_train_pred <- predict_table(tree_caret, train_data_caret, FALSE)
tree_caret_train_rmse <- pull_rmse(tree_caret_train_pred)
print(tree_caret_train_rmse)
## rmse real_rmse
## 1.078724 116.106006tree_caret_test_pred <- predict_table(tree_caret, test_data_caret, FALSE)
tree_caret_test_rmse <- pull_rmse(tree_caret_test_pred)
print(tree_caret_test_rmse)
## rmse real_rmse
## 1.074347 118.205989
Compare models
rbind(
base_train_rmse, base_test_rmse,
tree_tidy_train_rmse, tree_tidy_test_rmse,
tree_caret_train_rmse, tree_caret_test_rmse
)## rmse real_rmse
## base_train_rmse 2.032927 181.0633
## base_test_rmse 2.026080 182.6137
## tree_tidy_train_rmse 1.078724 116.1060
## tree_tidy_test_rmse 1.074347 118.2060
## tree_caret_train_rmse 1.078724 116.1060
## tree_caret_test_rmse 1.074347 118.2060
As you can see, the decision tree model results are the same regardless of the library, since I split the data and set up cross-validation the same way. Moreover, both tidymodels and caret use rpart
as the underlying engine. So it seems strange that tidymodels takes over 1 minute while caret only needs 4–5 seconds to run decision tree.
Conclusion
I have been using tidymodels for a few weeks now, and I really enjoy integrating to the tidyverse. But I find it confusing to have so many steps and objects. For example, I keep trying to get RMSE from a workflow object. I probably need a bit more time to get familiar with the new ecosystem.
The complete code is available on GitHub, and an HTML version of the Markdown is published. As always, I hope this post is helpful. Have a wonderful day!
Reference
Vabalas, A., Gowen, E., Poliakoff, E., & Casson, A. J. (2019). Machine learning algorithm validation with a limited sample size. PloS one, 14(11), e0224365.