pak::pkg_install(c("xgboost", "rBayesianOptimization", "mlbench", "dplyr", "recipes", "rsample"))Optimizing XGBoost Hyperparameters Using Bayesian Optimization in R
Hyperparameter tuning is computationally intensive and time-consuming. In this post, we’ll use Bayesian optimization to search for good XGBoost hyperparameters — a systematic approach that typically requires far fewer evaluations than an exhaustive grid search.
Required Packages
library(xgboost)
library(rBayesianOptimization)
library(mlbench)
library(dplyr)
library(recipes)
library(rsample)Data Acquisition and Initial Processing
This post uses the Boston Housing dataset — a regression problem with both numerical and categorical variables, and a great starting point for demonstrating Bayesian optimization.
# Load the Boston Housing dataset
data("BostonHousing2")
# Quick look at the data structure
str(BostonHousing2)'data.frame': 506 obs. of 19 variables:
$ town : Factor w/ 92 levels "Arlington","Ashland",..: 54 77 77 46 46 46 69 69 69 69 ...
$ tract : int 2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
$ lon : num -71 -71 -70.9 -70.9 -70.9 ...
$ lat : num 42.3 42.3 42.3 42.3 42.3 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
$ cmedv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : int 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ b : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
XGBoost requires numerical input features exclusively. We can use the recipes package to systematically transform categorical variables through appropriate preprocessing:
# Create a recipe for preprocessing
rec <- recipe(cmedv ~ ., data = BostonHousing2) %>%
# Collapse categories where population is < 3%
step_other(town, chas, threshold = .03, other = "Other") %>%
# Create dummy variables for all factor variables
step_dummy(all_nominal_predictors())
# Train the recipe on the dataset
prep <- prep(rec, training = BostonHousing2)
# Create the final model matrix
model_df <- bake(prep, new_data = BostonHousing2)
# Check the column names after one-hot encoding
colnames(model_df) [1] "tract" "lon" "lat"
[4] "medv" "crim" "zn"
[7] "indus" "nox" "rm"
[10] "age" "dis" "rad"
[13] "tax" "ptratio" "b"
[16] "lstat" "cmedv" "town_Boston.Savin.Hill"
[19] "town_Cambridge" "town_Lynn" "town_Newton"
[22] "town_Other" "chas_X1"
Let’s also partition the data into training and testing sets:
# Create a 70/30 train-test split
splits <- initial_split(model_df, prop = 0.7)
train_df <- training(splits)
test_df <- testing(splits)
# Prepare the training data for XGBoost
X <- train_df %>%
select(!medv, !cmedv) %>%
as.matrix()
# Get the target variable
y <- train_df %>%
pull(cmedv)
# Build xgb.DMatrix (required by xgboost >= 2.x)
dtrain <- xgb.DMatrix(data = X, label = y)
# Create cross-validation folds
folds <- list(
fold1 = as.integer(seq(1, nrow(X), by = 5)),
fold2 = as.integer(seq(2, nrow(X), by = 5))
)Bayesian Optimization Framework Implementation
The implementation of Bayesian optimization requires the development of two components:
- Objective Function: A function that evaluates model performance for a given hyperparameter configuration
- Parameter Space Definition: The bounds and constraints that define the search space for optimization
# Objective function takes hyperparameters as inputs
obj_func <- function(eta,
max_depth,
min_child_weight,
subsample,
lambda,
alpha) {
param <- list(
eta = eta, # Learning rate
max_depth = max_depth, # Tree depth
min_child_weight = min_child_weight, # Min observations per node
subsample = subsample, # Data subsampling
lambda = lambda, # L2 regularization
alpha = alpha, # L1 regularization
booster = "gbtree", # Use tree model
objective = "reg:squarederror", # Regression task
eval_metric = "mape" # Mean Absolute Percentage Error
)
xgbcv <- xgb.cv(params = param,
data = dtrain,
nrounds = 50,
folds = folds,
prediction = TRUE,
early_stopping_rounds = 5,
verbose = 0,
maximize = FALSE)
lst <- list(
# First argument must be named as "Score"
# Function finds maxima so inverting the output
Score = -min(xgbcv$evaluation_log$test_mape_mean),
# Pred is required by rBayesianOptimization (placeholder when not stacking)
Pred = 0
)
return(lst)
}
# Define the search space for each parameter
bounds <- list(
eta = c(0.001, 0.2), # Learning rate range
max_depth = c(1L, 10L), # Tree depth range
min_child_weight = c(1, 50), # Min observations range
subsample = c(0.1, 1), # Subsampling range
lambda = c(1, 10), # L2 regularization range
alpha = c(1, 10) # L1 regularization range
)Executing the Optimization Process
The optimization procedure employs intelligent search strategies to systematically explore the hyperparameter space, utilizing Gaussian process modeling to predict promising parameter combinations:
set.seed(1234)
bayes_out <- BayesianOptimization(
FUN = obj_func, # Our objective function
bounds = bounds, # Parameter bounds
init_points = length(bounds) + 2, # Initial random points
n_iter = 10, # Number of iterations
acq = "ucb", # Acquisition function
verbose = FALSE # Suppress output
)
Best Parameters Found:
Round = 10 eta = 0.1841661 max_depth = 10.0000 min_child_weight = 1.0000 subsample = 0.8610328 lambda = 1.829903 alpha = 5.728735 Value = -0.05119594
# View top results
head(bayes_out$History, 5) Round eta max_depth min_child_weight subsample lambda alpha
<int> <num> <num> <num> <num> <num> <num>
1: 1 0.02362698 7 15.02494 0.2969196 3.742050 5.980002
2: 2 0.12483758 6 14.07422 0.8295387 5.565762 6.817655
3: 3 0.12224567 7 10.14942 0.5731278 2.629866 3.806419
4: 4 0.12505251 6 12.37907 0.9231923 7.837036 6.596373
5: 5 0.17232216 4 16.51401 0.8482105 2.811232 3.967932
Value
<num>
1: -0.34482013
2: -0.11896350
3: -0.12416720
4: -0.09830514
5: -0.14300145
# Get the best parameters
best_params <- as.list(bayes_out$Best_Par)
best_params$eta
[1] 0.1841661
$max_depth
[1] 10
$min_child_weight
[1] 1
$subsample
[1] 0.8610328
$lambda
[1] 1.829903
$alpha
[1] 5.728735
Final Model Development and Evaluation
With the best parameters found by the optimizer, let’s train the final XGBoost model and evaluate it on the held-out test set.
# Combine best params with base params
opt_params <- xgb.params(
eta = best_params$eta,
max_depth = best_params$max_depth,
min_child_weight = best_params$min_child_weight,
subsample = best_params$subsample,
lambda = best_params$lambda,
alpha = best_params$alpha,
booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "mape"
)
# Fit the final XGBoost model
mdl <- xgb.train(
params = opt_params,
data = dtrain,
nrounds = 100,
verbose = 0
)
# Make predictions on the test set
actuals <- test_df$cmedv
test_features <- as.matrix(test_df[, getinfo(dtrain, "feature_name")])
dtest <- xgb.DMatrix(data = test_features)
predicted <- predict(mdl, newdata = test_features)
# Evaluate performance using Mean Absolute Percentage Error (MAPE)
mape <- mean(abs(actuals - predicted)/actuals)
cat("MAPE on test set:", mape)MAPE on test set: 0.01258446
Advantages of Bayesian Optimization
Compared to grid or random search, Bayesian optimization has a few advantages:
- Fewer evaluations: It uses a surrogate model to prioritise promising regions of the parameter space, so it typically needs fewer iterations to find good parameters
- Adaptive search: Each new evaluation informs the next, rather than being chosen blindly
- Scales reasonably well: Adding more hyperparameters doesn’t require an exponential increase in evaluations the way grid search does
Practical Considerations
This methodology becomes increasingly critical as model complexity and hyperparameter spaces expand. For production environments, increasing the iteration count (n_iter) could help ensure comprehensive exploration of the parameter landscape.
Key Takeaways
- Bayesian optimization builds a surrogate model of the objective function, using past results to guide future evaluations
- Data preprocessing matters for XGBoost — categorical variables need encoding before the algorithm can use them
- Cross-validation within the objective function gives a more reliable performance estimate for each parameter set
- Increasing
n_itergives the optimizer more chances to explore, which can help in high-dimensional parameter spaces
The same approach applies to any model that needs hyperparameter tuning — not just XGBoost.