Optimizing XGBoost Hyperparameters Using Bayesian Optimization in R

Analytics

Machine Learning

Published

September 18, 2024

Introduction

Hyperparameter tuning for machine learning models represents a computationally intensive and time-consuming process. This tutorial demonstrates the implementation of Bayesian optimization techniques to efficiently identify optimal XGBoost hyperparameters, thereby reducing computational overhead while enhancing model performance. The methodology presented provides a systematic approach to hyperparameter optimization that significantly outperforms traditional grid search methods.

Package Dependencies

This tutorial requires several specialized R packages for machine learning, optimization, and data preprocessing. The following packages must be installed and loaded before proceeding with the implementation.

# Load required packages
library(xgboost)
library(ParBayesianOptimization)
library(mlbench)
library(dplyr)
library(recipes)
library(rsample)

Data Acquisition and Initial Processing

This tutorial utilizes the Boston Housing dataset, a canonical regression problem that incorporates both numerical and categorical variables. This dataset provides an appropriate foundation for demonstrating Bayesian optimization techniques in a supervised learning context.

# Load the Boston Housing dataset
data("BostonHousing2")

# Quick look at the data structure
str(BostonHousing2)

'data.frame':   506 obs. of  19 variables:
 $ town   : Factor w/ 92 levels "Arlington","Ashland",..: 54 77 77 46 46 46 69 69 69 69 ...
 $ tract  : int  2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
 $ lon    : num  -71 -71 -70.9 -70.9 -70.9 ...
 $ lat    : num  42.3 42.3 42.3 42.3 42.3 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
 $ cmedv  : num  24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : int  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ b      : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...

XGBoost algorithms require numerical input features exclusively. Therefore, we employ the recipes package to systematically transform categorical variables through appropriate preprocessing techniques:

# Create a recipe for preprocessing
rec <- recipe(cmedv ~ ., data = BostonHousing2) %>%
  # Collapse categories where population is < 3%
  step_other(town, chas, threshold = .03, other = "Other") %>% 
  # Create dummy variables for all factor variables 
  step_dummy(all_nominal_predictors())

# Train the recipe on the dataset
prep <- prep(rec, training = BostonHousing2)

# Create the final model matrix
model_df <- bake(prep, new_data = BostonHousing2)

# Check the column names after one-hot encoding
colnames(model_df)

 [1] "tract"                  "lon"                    "lat"                   
 [4] "medv"                   "crim"                   "zn"                    
 [7] "indus"                  "nox"                    "rm"                    
[10] "age"                    "dis"                    "rad"                   
[13] "tax"                    "ptratio"                "b"                     
[16] "lstat"                  "cmedv"                  "town_Boston.Savin.Hill"
[19] "town_Cambridge"         "town_Lynn"              "town_Newton"           
[22] "town_Other"             "chas_X1"

Subsequently, we partition the dataset into training and testing subsets to enable proper model evaluation and prevent data leakage:

# Create a 70/30 train-test split
splits <- rsample::initial_split(model_df, prop = 0.7)
train_df <- rsample::training(splits)
test_df <- rsample::testing(splits)

# Prepare the training data for XGBoost
X <- train_df %>%
  select(!medv, !cmedv) %>%
  as.matrix()

# Get the target variable
y <- train_df %>% pull(cmedv)

# Create cross-validation folds
folds <- list(
  fold1 = as.integer(seq(1, nrow(X), by = 5)),
  fold2 = as.integer(seq(2, nrow(X), by = 5))
)

Bayesian Optimization Framework Implementation

The implementation of Bayesian optimization requires the development of two fundamental components:

Objective Function: A function that evaluates model performance for given hyperparameter configurations
Parameter Space Definition: The bounds and constraints that define the search space for optimization

# Our objective function takes hyperparameters as inputs
obj_func <- function(eta, max_depth, min_child_weight, subsample, lambda, alpha) {
  
  param <- list(
    # Learning parameters
    eta = eta,                       # Learning rate
    max_depth = max_depth,           # Tree depth
    min_child_weight = min_child_weight, # Min observations per node
    subsample = subsample,           # Data subsampling
    lambda = lambda,                 # L2 regularization
    alpha = alpha,                   # L1 regularization
    
    booster = "gbtree",             # Use tree model
    objective = "reg:squarederror",  # Regression task
    eval_metric = "mape"            # Mean Absolute Percentage Error
  )
  
  xgbcv <- xgb.cv(params = param,
                  data = X,
                  label = y,
                  nround = 50,
                  folds = folds,
                  prediction = TRUE,
                  early_stopping_rounds = 5,
                  verbose = 0,
                  maximize = FALSE)
  
  lst <- list(
    # First argument must be named as "Score"
    # Function finds maxima so inverting the output
    Score = -min(xgbcv$evaluation_log$test_mape_mean),
    
    # Get number of trees for the best performing model
    nrounds = xgbcv$best_iteration
  )
  
  return(lst)
}

# Define the search space for each parameter
bounds <- list(
  eta = c(0.001, 0.2),             # Learning rate range
  max_depth = c(1L, 10L),           # Tree depth range
  min_child_weight = c(1, 50),      # Min observations range
  subsample = c(0.1, 1),            # Subsampling range
  lambda = c(1, 10),                # L2 regularization range
  alpha = c(1, 10)                  # L1 regularization range
)

Executing the Optimization Process

The optimization procedure employs intelligent search strategies to systematically explore the hyperparameter space, utilizing Gaussian process modeling to predict promising parameter combinations:

set.seed(1234)
bayes_out <- bayesOpt(
  FUN = obj_func,                    # Our objective function
  bounds = bounds,                   # Parameter bounds
  initPoints = length(bounds) + 2,   # Initial random points
  iters.n = 10,                      # Number of iterations
  verbose = 0                        # Suppress output
)

# View top results
bayes_out$scoreSummary[1:5, c(3:8, 13)]

          eta max_depth min_child_weight subsample   lambda    alpha      Score
        <num>     <num>            <num>     <num>    <num>    <num>      <num>
1: 0.13392137         8         4.913332 0.2105925 4.721124 3.887629 -0.1292920
2: 0.19400811         2        25.454160 0.9594105 9.329695 3.173695 -0.1790158
3: 0.16079775         2        14.035652 0.5118349 1.229953 5.093530 -0.1662595
4: 0.08957707         4        12.534842 0.3844404 4.358837 1.788342 -0.1672395
5: 0.02876388         4        36.586761 0.8107181 6.137100 6.039125 -0.3320015

# Get the best parameters
best_params <- getBestPars(bayes_out)
data.frame(best_params)

        eta max_depth min_child_weight subsample lambda    alpha
1 0.1251447        10                1         1      1 5.905011

Final Model Development and Evaluation

Upon identification of optimal hyperparameter configurations, we proceed to train the final XGBoost model and evaluate its performance on the held-out test dataset.

# Combine best params with base params
opt_params <- append(
  list(booster = "gbtree", 
       objective = "reg:squarederror", 
       eval_metric = "mae"), 
  best_params
)

# Run cross-validation to determine optimal number of rounds
xgbcv <- xgb.cv(
  params = opt_params,
  data = X,
  label = y,
  nround = 100,
  folds = folds,
  prediction = TRUE,
  early_stopping_rounds = 5,
  verbose = 0,
  maximize = FALSE
)

# Get optimal number of rounds
nrounds = xgbcv$best_iteration

# Fit the final XGBoost model
mdl <- xgboost(
  data = X, 
  label = y, 
  params = opt_params, 
  maximize = FALSE, 
  early_stopping_rounds = 5, 
  nrounds = nrounds, 
  verbose = 0
)

# Make predictions on the test set
actuals <- test_df$cmedv
predicted <- test_df %>%
  select_at(mdl$feature_names) %>%
  as.matrix() %>%
  predict(mdl, newdata = .)

# Evaluate performance using Mean Absolute Percentage Error (MAPE)
mape <- mean(abs(actuals - predicted)/actuals)
cat("MAPE on test set:", mape)

MAPE on test set: 0.006424492

Key Advantages of Bayesian Optimization

Bayesian optimization demonstrates significant superiority over conventional hyperparameter tuning methodologies, particularly grid search approaches:

Computational Efficiency: Achieves optimal parameter identification through substantially fewer iterations
Adaptive Learning: Incorporates knowledge from previous evaluations to concentrate search efforts on promising parameter regions
Scalability: Maintains computational efficiency regardless of hyperparameter dimensionality
Time Optimization: Completes optimization procedures in significantly reduced timeframes while achieving equivalent or superior performance outcomes

Practical Considerations

This methodology becomes increasingly critical as model complexity and hyperparameter spaces expand. For production environments, increasing the iteration count (iters.n) could help ensure comprehensive exploration of the parameter landscape.

Key Takeaways

Gaussian process modeling enables intelligent parameter space exploration by learning from previous evaluations
Proper data preprocessing is critical for XGBoost implementation, particularly categorical variable encoding
Cross-validation integration ensures robust hyperparameter evaluation and prevents overfitting to specific data partitions
The methodology scales effectively to high-dimensional hyperparameter spaces without proportional computational increases
Production implementations benefit from increased iteration counts to ensure thorough parameter space exploration

This technique can extend beyond XGBoost to any machine learning algorithm requiring hyperparameter optimization, providing a foundation for efficient model development across diverse analytical contexts.