Using bayesian optimisation to tune a XGBOOST model in R

ML series (Post #1)

Riddhiman

My first post in 2022! A very happy new year to anyone reading this. šŸ˜„

I was looking for a simple and effective way to tune xgboost models in R and came across this package called ParBayesianOptimization. Here’s a quick tutorial on how to use it to tune a xgboost model.

# Pacman is a package management tool 
install.packages("pacman")
library(pacman)

# p_load automatically installs packages if needed
p_load(xgboost, ParBayesianOptimization, mlbench, dplyr, skimr, recipes, resample)

Data prep

# Load up some data
data("BostonHousing2")
# Data summary
skim(BostonHousing2)

Table: Table 1: Data summary

Name BostonHousing2
Number of rows 506
Number of columns 19
_______________________
Column type frequency:
factor 2
numeric 17
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
town 0 1 FALSE 92 Cam: 30, Bos: 23, Lyn: 22, Bos: 19
chas 0 1 FALSE 2 0: 471, 1: 35

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
tract 0 1 2700.36 1380.04 1.00 1303.25 3393.50 3739.75 5082.00 ā–…ā–‚ā–‚ā–‡ā–‚
lon 0 1 -71.06 0.08 -71.29 -71.09 -71.05 -71.02 -70.81 ā–ā–‚ā–‡ā–‚ā–
lat 0 1 42.22 0.06 42.03 42.18 42.22 42.25 42.38 ā–ā–ƒā–‡ā–ƒā–
medv 0 1 22.53 9.20 5.00 17.02 21.20 25.00 50.00 ā–‚ā–‡ā–…ā–ā–
cmedv 0 1 22.53 9.18 5.00 17.02 21.20 25.00 50.00 ā–‚ā–‡ā–…ā–ā–
crim 0 1 3.61 8.60 0.01 0.08 0.26 3.68 88.98 ā–‡ā–ā–ā–ā–
zn 0 1 11.36 23.32 0.00 0.00 0.00 12.50 100.00 ā–‡ā–ā–ā–ā–
indus 0 1 11.14 6.86 0.46 5.19 9.69 18.10 27.74 ā–‡ā–†ā–ā–‡ā–
nox 0 1 0.55 0.12 0.38 0.45 0.54 0.62 0.87 ā–‡ā–‡ā–†ā–…ā–
rm 0 1 6.28 0.70 3.56 5.89 6.21 6.62 8.78 ā–ā–‚ā–‡ā–‚ā–
age 0 1 68.57 28.15 2.90 45.02 77.50 94.07 100.00 ā–‚ā–‚ā–‚ā–ƒā–‡
dis 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.13 ā–‡ā–…ā–‚ā–ā–
rad 0 1 9.55 8.71 1.00 4.00 5.00 24.00 24.00 ā–‡ā–‚ā–ā–ā–ƒ
tax 0 1 408.24 168.54 187.00 279.00 330.00 666.00 711.00 ā–‡ā–‡ā–ƒā–ā–‡
ptratio 0 1 18.46 2.16 12.60 17.40 19.05 20.20 22.00 ā–ā–ƒā–…ā–…ā–‡
b 0 1 356.67 91.29 0.32 375.38 391.44 396.22 396.90 ā–ā–ā–ā–ā–‡
lstat 0 1 12.65 7.14 1.73 6.95 11.36 16.96 37.97 ā–‡ā–‡ā–…ā–‚ā–

Looks like there is are two factor variables. We’ll need to convert them into numeric variables before we proceed. I’ll use the recipes package to one-hot encode them.

# Predicting median house prices
rec <- recipe(cmedv ~ ., data = BostonHousing2) %>%
  
  # Collapse categories where population is < 3%
  step_other(town, chas, threshold = .03, other = "Other") %>% 
  
  # Create dummy variables for all factor variables 
  step_dummy(all_nominal_predictors())

# Train the recipe on the data set
prep <- prep(rec, training = BostonHousing2)

# Create the final model matrix
model_df <- bake(prep, new_data = BostonHousing2)
# All levels have been one hot encoded and separate columns have been appended to the model matrix
colnames(model_df)
##  [1] "tract"                  "lon"                    "lat"                   
##  [4] "medv"                   "crim"                   "zn"                    
##  [7] "indus"                  "nox"                    "rm"                    
## [10] "age"                    "dis"                    "rad"                   
## [13] "tax"                    "ptratio"                "b"                     
## [16] "lstat"                  "cmedv"                  "town_Boston.Savin.Hill"
## [19] "town_Cambridge"         "town_Lynn"              "town_Newton"           
## [22] "town_Other"             "chas_X1"

Next, we can use the resample package to create test/train splits.

splits <- rsample::initial_split(model_df, prop = 0.7)

# Training set
train_df <- rsample::training(splits)

# Test set
test_df <- rsample::testing(splits)
dim(train_df)
## [1] 354  23
dim(test_df)
## [1] 152  23

Finding optimal parameters

Now we can start to run some optimisations using the ParBayesianOptimization package.

# The xgboost interface accepts matrices 
X <- train_df %>%
  # Remove the target variable
  select(!medv, !cmedv) %>%
  as.matrix()

# Get the target variable
y <- train_df %>%
  pull(cmedv)
# Cross validation folds
folds <- list(fold1 = as.integer(seq(1, nrow(X), by = 5)),
              fold2 = as.integer(seq(2, nrow(X), by = 5)))

We’ll need an objective function which can be fed to the optimiser. We’ll use the value of the evaluation metric from xgb.cv() as the value that needs to be optimised.

# Function must take the hyper-parameters as inputs
obj_func <- function(eta, max_depth, min_child_weight, subsample, lambda, alpha) {
  
  param <- list(
    
    # Hyter parameters 
    eta = eta,
    max_depth = max_depth,
    min_child_weight = min_child_weight,
    subsample = subsample,
    lambda = lambda,
    alpha = alpha,
    
    # Tree model 
    booster = "gbtree",
    
    # Regression problem 
    objective = "reg:squarederror",
    
    # Use the Mean Absolute Percentage Error
    eval_metric = "mape")
  
  xgbcv <- xgb.cv(params = param,
                  data = X,
                  label = y,
                  nround = 50,
                  folds = folds,
                  prediction = TRUE,
                  early_stopping_rounds = 5,
                  verbose = 0,
                  maximize = F)
  
  lst <- list(
    
    # First argument must be named as "Score"
    # Function finds maxima so inverting the output
    Score = -min(xgbcv$evaluation_log$test_mape_mean),
    
    # Get number of trees for the best performing model
    nrounds = xgbcv$best_iteration
  )
  
  return(lst)
}

Once we have the objective function, we’ll need to define some bounds for the optimiser to search within.

bounds <- list(eta = c(0.001, 0.2),
               max_depth = c(1L, 10L),
               min_child_weight = c(1, 50),
               subsample = c(0.1, 1),
               lambda = c(1, 10),
               alpha = c(1, 10))

We can now run the optimiser to find a set of optimal hyper-parameters.

set.seed(1234)
bayes_out <- bayesOpt(FUN = obj_func, bounds = bounds, initPoints = length(bounds) + 2, iters.n = 3)
# Show relevant columns from the summary object 
bayes_out$scoreSummary[1:5, c(3:8, 13)]
##           eta max_depth min_child_weight subsample   lambda    alpha      Score
## 1: 0.13392137         8         4.913332 0.2105925 4.721124 3.887629 -0.0902970
## 2: 0.19400811         2        25.454160 0.9594105 9.329695 3.173695 -0.1402720
## 3: 0.16079775         2        14.035652 0.5118349 1.229953 5.093530 -0.1475580
## 4: 0.08957707         4        12.534842 0.3844404 4.358837 1.788342 -0.1410245
## 5: 0.02876388         4        36.586761 0.8107181 6.137100 6.039125 -0.3061535
# Get best parameters
data.frame(getBestPars(bayes_out))
##         eta max_depth min_child_weight subsample lambda alpha
## 1 0.1905414         8         1.541476 0.8729207      1     1

Fitting the model

We can now fit a model and check how well these parameters work.

# Combine best params with base params
opt_params <- append(list(booster = "gbtree", 
                          objective = "reg:squarederror", 
                          eval_metric = "mae"), 
                     getBestPars(bayes_out))

# Run cross validation 
xgbcv <- xgb.cv(params = opt_params,
                data = X,
                label = y,
                nround = 100,
                folds = folds,
                prediction = TRUE,
                early_stopping_rounds = 5,
                verbose = 0,
                maximize = F)

# Get optimal number of rounds
nrounds = xgbcv$best_iteration

# Fit a xgb model
mdl <- xgboost(data = X, label = y, 
               params = opt_params, 
               maximize = F, 
               early_stopping_rounds = 5, 
               nrounds = nrounds, 
               verbose = 0)
# Evaluate performance 
actuals <- test_df$cmedv
predicted <- test_df %>%
  select_at(mdl$feature_names) %>%
  as.matrix %>%
  predict(mdl, newdata = .)
# Compute MAPE
mean(abs(actuals - predicted)/actuals)
## [1] 0.009401109
grd <- expand.grid(
  eta = seq(0.001, 0.2, length.out = 5),
  max_depth = seq(2L, 10L, by = 1),
  min_child_weight = seq(1, 25, length.out = 3),
  subsample = c(0.25, 0.5, 0.75, 1),
  lambda = c(1, 5, 10),
  alpha = c(1, 5, 10))

dim(grd)
grd_out <- apply(grd, 1, function(par){
  
    par <- append(par, list(booster = "gbtree",objective = "reg:squarederror",eval_metric = "mae"))
    mdl <- xgboost(data = X, label = y, params = par, nrounds = 50, early_stopping_rounds = 5, maximize = F, verbose = 0)
    lst <- data.frame(par, score = mdl$best_score)

    return(lst)
  })

grd_out <- do.call(rbind, grd_out)
best_par <- grd_out %>%
  data.frame() %>%
  arrange(score) %>%
  .[1,]
# Fit final model
params <- as.list(best_par[-length(best_par)])
xgbcv <- xgb.cv(params = params,
                data = X,
                label = y,
                nround = 100,
                folds = folds,
                prediction = TRUE,
                early_stopping_rounds = 5,
                verbose = 0,
                maximize = F)

nrounds = xgbcv$best_iteration

mdl <- xgboost(data = X, 
               label = y, 
               params = params, 
               maximize = F, 
               early_stopping_rounds = 5, 
               nrounds = nrounds, 
               verbose = 0)
# Evaluate on test set
act <- test_df$medv
pred <- test_df %>%
  select_at(mdl$feature_names) %>%
  as.matrix %>%
  predict(mdl, newdata = .)
mean(abs(act - pred)/act)
## [1] 0.01581713

While both the methods offer similar final results, the bayesian optimiser completed its search in less than a minute where as the grid search took over seven minutes. Also, I find that I can use bayesian optimisation to search a larger parameter space more quickly than a traditional grid search.

Thoughts? Comments? Helpful? Not helpful? Like to see anything else added in here? Let me know!