if(!(require(rpart))) install.packages("rpart")
if(!(require(rpart.plot))) install.packages("rpart.plot")
Basic regression trees partition a data set into smaller groups and then fit a simple model (constant) for each subgroup. Unfortunately, a single tree model tends to be highly unstable and a poor predictor. However, by bootstrap aggregating (bagging) regression trees, this technique can become quite powerful and effective. Moreover, this provides the fundamental basis of more complex tree-based models such as random forests and gradient boosting machines. This tutorial will get you started with regression trees and bagging.
There are many methodologies for constructing regression trees but one of the oldest is known as the classification and regression tree (CART) approach developed by Breiman et al. (1984). This tutorial focuses on the regression part of CART. Basic regression trees partition a data set into smaller subgroups and then fit a simple constant for each observation in the subgroup. The partitioning is achieved by successive binary partitions (aka recursive partitioning) based on the different predictors. The constant to predict is based on the average response values for all observations that fall in that subgroup.
For example, consider we want to predict the quantity of ozone in the air based on wind speed. Here is how the data look.
data("airquality")
summary(airquality[,c("Ozone","Wind")])
## Ozone Wind
## Min. : 1.00 Min. : 1.700
## 1st Qu.: 18.00 1st Qu.: 7.400
## Median : 31.50 Median : 9.700
## Mean : 42.13 Mean : 9.958
## 3rd Qu.: 63.25 3rd Qu.:11.500
## Max. :168.00 Max. :20.700
## NA's :37
with(airquality,plot(Wind,Ozone))
The relationship is not linear, hence using regression trees may be efficient.
Load rpart
and rpart.plot
:
library(rpart)
library(rpart.plot)
First we must eliminate the observations with missing values.
bagging_data <- na.omit(airquality[,c("Ozone","Wind")])
Then, a regression tree was calculated on all the data.
model <- rpart(Ozone ~ Wind, data=bagging_data)
rpart.plot(model, extra = 0)
All observations go through this tree, are assessed at a particular node, and proceed to the left if the answer is “yes” or proceed to the right if the answer is “no”.
We proceed to draw the regression model:
with(bagging_data,plot(Wind,Ozone))
oidx <- order(bagging_data$Wind)
lines(bagging_data$Wind[oidx], predict(model)[oidx])
In addition to the cost complexity \(\alpha\) parameter, it is also common to tune:
minsplit
: the minimum number of data points required to attempt a split before it is forced to create a terminal node. The default is 20. Making this smaller allows for terminal nodes that may contain only a handful of observations to create the predicted value.
maxdepth
: the maximum number of internal nodes between the root node and the terminal nodes. The default is 30, which is quite liberal and allows for fairly large trees to be built.
rpart
uses a special control
argument where we provide a list of hyperparameter values. For example, if we wanted to assess a model with minsplit = 6
, we could execute the following:
model <- rpart(Ozone ~ Wind, data=bagging_data, control = list(minsplit = 6))
with(bagging_data,plot(Wind,Ozone))
lines(bagging_data$Wind[oidx], predict(model)[oidx])
One of the drawbacks of decision trees is that single tree models suffer from high variance. Although pruning the tree helps reduce this variance, there are alternative methods that actually exploite the variability of single trees in a way that can significantly improve performance over and above that of single trees. Bootstrap aggregating (bagging) is one such approach (originally proposed by Breiman, 1996).
Bagging combines and averages multiple models. Averaging across multiple trees reduces the variability of any one tree and reduces overfitting, which improves predictive performance. Bagging follows three simple steps:
This process can actually be applied to any regression or classification model; however, it provides the greatest improvement for models that have high variance. For example, more stable parametric models such as linear regression and multi-adaptive regression splines tend to experience less improvement in predictive performance.
One benefit of bagging is that, on average, a bootstrap sample will contain 66% (2/3) of the training data. This leaves about 33% (1/3) of the data out of the bootstrapped sample. We call this the out-of-bag (OOB) sample. We can use the OOB observations to estimate the model’s accuracy, creating a natural cross-validation process.
Fitting a bagged tree model is quite simple. Instead of using rpart
we should use ipred::bagging
. We use coob = TRUE
to use the OOB sample to estimate the test error.
Bagging with ipred
is quite simple; however, there are some additional benefits of bagging with caret
.
In this document we will proceed to show a bagging method manually.
To build a bagged trees, the process is easy. Let’s say you want 100 models that you will average, for each of the hundred iterations you will:
Once you trained all your models, to get a prediction from your bagged model on new data, you will need to:
First, we proceed to divide the data into the training data and the test data:
set.seed(456)
n <- nrow(bagging_data)
train_idx <- sample.int(n,size=round(n*0.8),replace = F)
data_train <- bagging_data[train_idx,]
data_test <- bagging_data[-train_idx,]
Model without bagging:
no_bag_model <- rpart(Ozone~Wind,data_train,control=rpart.control(minsplit=6))
Training of the bagged model:
n_model <- 100
bagged_models <- list()
for (i in 1:n_model)
{
new_sample <- sample(train_idx,size=length(train_idx),replace=T)
bagged_models <- c(bagged_models,list(rpart(Ozone~Wind,data_train[new_sample,],
control=rpart.control(minsplit=6))))
}
Getting estimate from the bagged model:
bagged_result <- NULL
i <- 0
for (from_bag_model in bagged_models)
{
if (is.null(bagged_result))
bagged_result <- predict(from_bag_model,bagging_data)
else
bagged_result <- (i*bagged_result+predict(from_bag_model,bagging_data))/(i+1)
i <- i+1
}
Plot:
with(data_train,plot(Wind,Ozone,pch=19,col="blue"))
with(data_test,points(Wind,Ozone,pch=19,col="red"))
legend("topright",legend = c("train","test"),col = c("blue","red"),pch = 19)
oidx <- order(bagging_data$Wind)
lines(bagging_data$Wind[oidx], predict(no_bag_model, bagging_data)[oidx],
col="blue", lwd=1.5)
lines(bagging_data$Wind[oidx], bagged_result[oidx],
col="green", lwd=2)
Decision trees provide a very intuitive modeling approach that have several, flexible benefits. Unfortunately, they suffer from high variance; however, when you combine them with bagging you can minimize this drawback. Moreover, bagged trees provides the fundamental basis for more complex and powerful algorithms such as random forests and gradient boosting machines. To learn more I would start with the following resources listed in order of complexity: