My first post in 2022! A very happy new year to anyone reading this. š
I was looking for a simple and effective way to tune xgboost
models in R
and came across this package called ParBayesianOptimization. Here’s a quick tutorial on how to use it to tune a xgboost
model.
# Pacman is a package management tool
install.packages("pacman")
library(pacman)
# p_load automatically installs packages if needed
p_load(xgboost, ParBayesianOptimization, mlbench, dplyr, skimr, recipes, resample)
Data prep
# Load up some data
data("BostonHousing2")
# Data summary
skim(BostonHousing2)
Table: Table 1: Data summary
Name | BostonHousing2 |
Number of rows | 506 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 17 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
town | 0 | 1 | FALSE | 92 | Cam: 30, Bos: 23, Lyn: 22, Bos: 19 |
chas | 0 | 1 | FALSE | 2 | 0: 471, 1: 35 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
tract | 0 | 1 | 2700.36 | 1380.04 | 1.00 | 1303.25 | 3393.50 | 3739.75 | 5082.00 | ā āāāā |
lon | 0 | 1 | -71.06 | 0.08 | -71.29 | -71.09 | -71.05 | -71.02 | -70.81 | āāāāā |
lat | 0 | 1 | 42.22 | 0.06 | 42.03 | 42.18 | 42.22 | 42.25 | 42.38 | āāāāā |
medv | 0 | 1 | 22.53 | 9.20 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | āāā āā |
cmedv | 0 | 1 | 22.53 | 9.18 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | āāā āā |
crim | 0 | 1 | 3.61 | 8.60 | 0.01 | 0.08 | 0.26 | 3.68 | 88.98 | āāāāā |
zn | 0 | 1 | 11.36 | 23.32 | 0.00 | 0.00 | 0.00 | 12.50 | 100.00 | āāāāā |
indus | 0 | 1 | 11.14 | 6.86 | 0.46 | 5.19 | 9.69 | 18.10 | 27.74 | āāāāā |
nox | 0 | 1 | 0.55 | 0.12 | 0.38 | 0.45 | 0.54 | 0.62 | 0.87 | āāāā ā |
rm | 0 | 1 | 6.28 | 0.70 | 3.56 | 5.89 | 6.21 | 6.62 | 8.78 | āāāāā |
age | 0 | 1 | 68.57 | 28.15 | 2.90 | 45.02 | 77.50 | 94.07 | 100.00 | āāāāā |
dis | 0 | 1 | 3.80 | 2.11 | 1.13 | 2.10 | 3.21 | 5.19 | 12.13 | āā āāā |
rad | 0 | 1 | 9.55 | 8.71 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | āāāāā |
tax | 0 | 1 | 408.24 | 168.54 | 187.00 | 279.00 | 330.00 | 666.00 | 711.00 | āāāāā |
ptratio | 0 | 1 | 18.46 | 2.16 | 12.60 | 17.40 | 19.05 | 20.20 | 22.00 | āāā ā ā |
b | 0 | 1 | 356.67 | 91.29 | 0.32 | 375.38 | 391.44 | 396.22 | 396.90 | āāāāā |
lstat | 0 | 1 | 12.65 | 7.14 | 1.73 | 6.95 | 11.36 | 16.96 | 37.97 | āāā āā |
Looks like there is are two factor variables. We’ll need to convert them into numeric variables before we proceed. I’ll use the recipes
package to one-hot encode them.
# Predicting median house prices
rec <- recipe(cmedv ~ ., data = BostonHousing2) %>%
# Collapse categories where population is < 3%
step_other(town, chas, threshold = .03, other = "Other") %>%
# Create dummy variables for all factor variables
step_dummy(all_nominal_predictors())
# Train the recipe on the data set
prep <- prep(rec, training = BostonHousing2)
# Create the final model matrix
model_df <- bake(prep, new_data = BostonHousing2)
# All levels have been one hot encoded and separate columns have been appended to the model matrix
colnames(model_df)
## [1] "tract" "lon" "lat"
## [4] "medv" "crim" "zn"
## [7] "indus" "nox" "rm"
## [10] "age" "dis" "rad"
## [13] "tax" "ptratio" "b"
## [16] "lstat" "cmedv" "town_Boston.Savin.Hill"
## [19] "town_Cambridge" "town_Lynn" "town_Newton"
## [22] "town_Other" "chas_X1"
Next, we can use the resample
package to create test/train splits.
splits <- rsample::initial_split(model_df, prop = 0.7)
# Training set
train_df <- rsample::training(splits)
# Test set
test_df <- rsample::testing(splits)
dim(train_df)
## [1] 354 23
dim(test_df)
## [1] 152 23
Finding optimal parameters
Now we can start to run some optimisations using the ParBayesianOptimization
package.
# The xgboost interface accepts matrices
X <- train_df %>%
# Remove the target variable
select(!medv, !cmedv) %>%
as.matrix()
# Get the target variable
y <- train_df %>%
pull(cmedv)
# Cross validation folds
folds <- list(fold1 = as.integer(seq(1, nrow(X), by = 5)),
fold2 = as.integer(seq(2, nrow(X), by = 5)))
We’ll need an objective function which can be fed to the optimiser. We’ll use the value of the evaluation metric from xgb.cv()
as the value that needs to be optimised.
# Function must take the hyper-parameters as inputs
obj_func <- function(eta, max_depth, min_child_weight, subsample, lambda, alpha) {
param <- list(
# Hyter parameters
eta = eta,
max_depth = max_depth,
min_child_weight = min_child_weight,
subsample = subsample,
lambda = lambda,
alpha = alpha,
# Tree model
booster = "gbtree",
# Regression problem
objective = "reg:squarederror",
# Use the Mean Absolute Percentage Error
eval_metric = "mape")
xgbcv <- xgb.cv(params = param,
data = X,
label = y,
nround = 50,
folds = folds,
prediction = TRUE,
early_stopping_rounds = 5,
verbose = 0,
maximize = F)
lst <- list(
# First argument must be named as "Score"
# Function finds maxima so inverting the output
Score = -min(xgbcv$evaluation_log$test_mape_mean),
# Get number of trees for the best performing model
nrounds = xgbcv$best_iteration
)
return(lst)
}
Once we have the objective function, we’ll need to define some bounds for the optimiser to search within.
bounds <- list(eta = c(0.001, 0.2),
max_depth = c(1L, 10L),
min_child_weight = c(1, 50),
subsample = c(0.1, 1),
lambda = c(1, 10),
alpha = c(1, 10))
We can now run the optimiser to find a set of optimal hyper-parameters.
set.seed(1234)
bayes_out <- bayesOpt(FUN = obj_func, bounds = bounds, initPoints = length(bounds) + 2, iters.n = 3)
# Show relevant columns from the summary object
bayes_out$scoreSummary[1:5, c(3:8, 13)]
## eta max_depth min_child_weight subsample lambda alpha Score
## 1: 0.13392137 8 4.913332 0.2105925 4.721124 3.887629 -0.0902970
## 2: 0.19400811 2 25.454160 0.9594105 9.329695 3.173695 -0.1402720
## 3: 0.16079775 2 14.035652 0.5118349 1.229953 5.093530 -0.1475580
## 4: 0.08957707 4 12.534842 0.3844404 4.358837 1.788342 -0.1410245
## 5: 0.02876388 4 36.586761 0.8107181 6.137100 6.039125 -0.3061535
# Get best parameters
data.frame(getBestPars(bayes_out))
## eta max_depth min_child_weight subsample lambda alpha
## 1 0.1905414 8 1.541476 0.8729207 1 1
Fitting the model
We can now fit a model and check how well these parameters work.
# Combine best params with base params
opt_params <- append(list(booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "mae"),
getBestPars(bayes_out))
# Run cross validation
xgbcv <- xgb.cv(params = opt_params,
data = X,
label = y,
nround = 100,
folds = folds,
prediction = TRUE,
early_stopping_rounds = 5,
verbose = 0,
maximize = F)
# Get optimal number of rounds
nrounds = xgbcv$best_iteration
# Fit a xgb model
mdl <- xgboost(data = X, label = y,
params = opt_params,
maximize = F,
early_stopping_rounds = 5,
nrounds = nrounds,
verbose = 0)
# Evaluate performance
actuals <- test_df$cmedv
predicted <- test_df %>%
select_at(mdl$feature_names) %>%
as.matrix %>%
predict(mdl, newdata = .)
# Compute MAPE
mean(abs(actuals - predicted)/actuals)
## [1] 0.009401109
Compare with grid search
grd <- expand.grid(
eta = seq(0.001, 0.2, length.out = 5),
max_depth = seq(2L, 10L, by = 1),
min_child_weight = seq(1, 25, length.out = 3),
subsample = c(0.25, 0.5, 0.75, 1),
lambda = c(1, 5, 10),
alpha = c(1, 5, 10))
dim(grd)
grd_out <- apply(grd, 1, function(par){
par <- append(par, list(booster = "gbtree",objective = "reg:squarederror",eval_metric = "mae"))
mdl <- xgboost(data = X, label = y, params = par, nrounds = 50, early_stopping_rounds = 5, maximize = F, verbose = 0)
lst <- data.frame(par, score = mdl$best_score)
return(lst)
})
grd_out <- do.call(rbind, grd_out)
best_par <- grd_out %>%
data.frame() %>%
arrange(score) %>%
.[1,]
# Fit final model
params <- as.list(best_par[-length(best_par)])
xgbcv <- xgb.cv(params = params,
data = X,
label = y,
nround = 100,
folds = folds,
prediction = TRUE,
early_stopping_rounds = 5,
verbose = 0,
maximize = F)
nrounds = xgbcv$best_iteration
mdl <- xgboost(data = X,
label = y,
params = params,
maximize = F,
early_stopping_rounds = 5,
nrounds = nrounds,
verbose = 0)
# Evaluate on test set
act <- test_df$medv
pred <- test_df %>%
select_at(mdl$feature_names) %>%
as.matrix %>%
predict(mdl, newdata = .)
mean(abs(act - pred)/act)
## [1] 0.01581713
While both the methods offer similar final results, the bayesian optimiser completed its search in less than a minute where as the grid search took over seven minutes. Also, I find that I can use bayesian optimisation to search a larger parameter space more quickly than a traditional grid search.
Thoughts? Comments? Helpful? Not helpful? Like to see anything else added in here? Let me know!