Tutorial: Assessing Credit Score Prediction Reliability Using Bootstrap Resampling

R
Credit Risk Analytics
Bootstrapping
Published

November 15, 2024

Introduction

Credit scoring models demonstrate optimal performance within the central regions of the score distribution, yet exhibit diminished reliability at the distribution extremes where data becomes sparse. This tutorial provides a comprehensive methodology for employing bootstrap resampling techniques to quantify prediction variability across different score ranges, thereby enabling practitioners to identify regions where their models demonstrate the highest degree of dependability.

Theoretical Foundation: Understanding Estimation Variance

Reduced sample sizes inherently result in increased variance in statistical estimates, particularly for extreme values within a distribution. While central tendency measures such as means demonstrate relative stability when estimated from limited data, tail percentiles (95th, 99th) exhibit substantially greater volatility. This phenomenon is of critical importance in credit scoring applications, where extremely high and low scores correspond to these unstable tail regions of the distribution.

# Number of samples to be drawn from a probability distribution
n_samples <- 1000

# Number of times, sampling should be repeated
repeats <- 100

# Mean and std-dev for a standard normal distribution
mu <- 5
std_dev <- 2

# Sample
samples <- rnorm(n_samples * repeats, mean = 10)

# Fit into a matrix like object with `n_samples' number of rows 
# and `repeats' number of columns
samples <- matrix(samples, nrow = n_samples, ncol = repeats)

# Compute mean across each column
sample_means <- apply(samples, 1, mean)

# Similarly, compute 75% and 95% quantile across each column
sample_75_quantile <- apply(samples, 1, quantile, p = 0.75)
sample_95_quantile <- apply(samples, 1, quantile, p = 0.95)
sample_99_quantile <- apply(samples, 1, quantile, p = 0.99)

# Compare coefficient of variation
sd(sample_means)/mean(sample_means)
[1] 0.0100664
sd(sample_75_quantile)/mean(sample_75_quantile)
[1] 0.01270263
sd(sample_95_quantile)/mean(sample_75_quantile)
[1] 0.01888391
# Plot the distributions
combined_vec <- c(sample_means, sample_75_quantile, sample_95_quantile, sample_99_quantile)

plot(density(sample_means), 
     col = "#6F69AC", 
     lwd = 3, 
     main = "Estimating the mean vs tail quantiles", 
     xlab = "", 
     xlim = c(min(combined_vec), max(combined_vec)))

lines(density(sample_75_quantile), col = "#95DAC1", lwd = 3)
lines(density(sample_95_quantile), col = "#FFEBA1", lwd = 3)
lines(density(sample_99_quantile), col = "#FD6F96", lwd = 3)
grid()

legend("topright", 
       fill = c("#6F69AC", "#95DAC1", "#FFEBA1", "#FD6F96"), 
       legend = c("Mean", "75% Quantile", "95% Quantile", "99% Quantile"), 
       cex = 0.7)

The visualization demonstrates the substantial increase in uncertainty when estimating extreme values compared to central tendencies. The distribution corresponding to the mean (purple) exhibits considerably narrower dispersion than that of the 99th percentile (pink). This statistical principle directly applies to credit scoring applications, where scores at the extremes of the distribution inherently possess greater uncertainty.

# Load required packages
library(dplyr)
library(magrittr)
library(rsample)
library(ggplot2)

Data Acquisition and Preprocessing

In this tutorial, we will utilize a sample from the Lending Club dataset. Loans classified as “Charged Off” will be designated as defaults. The observed class imbalance represents a typical characteristic of credit portfolios and contributes significantly to prediction challenges at distribution extremes.

# Load sample data (sample of the lending club data)
sample <- read.csv("http://bit.ly/42ypcnJ")

# Mark which loan status will be tagged as default
codes <- c("Charged Off", "Does not meet the credit policy. Status:Charged Off")

# Apply above codes and create target
sample %<>% mutate(bad_flag = ifelse(loan_status %in% codes, 1, 0))

# Replace missing values with a default value
sample[is.na(sample)] <- -1

# Get summary tally
table(sample$bad_flag)

   0    1 
8838 1162 

Implementing Bootstrap Resampling Strategy

This section demonstrates the creation of 100 bootstrap samples to quantify the variation in model predictions across different score ranges. Bootstrap resampling is a statistical technique that generates multiple simulated datasets to measure prediction uncertainty without requiring additional data collection.

# Create 100 samples
boot_sample <- bootstraps(data = sample, times = 100)

head(boot_sample, 3)
# A tibble: 3 × 2
  splits               id          
  <list>               <chr>       
1 <split [10000/3700]> Bootstrap001
2 <split [10000/3630]> Bootstrap002
3 <split [10000/3639]> Bootstrap003
# Each row represents a separate bootstrapped sample with an analysis set and assessment set
boot_sample$splits[[1]]
<Analysis/Assess/Total>
<10000/3700/10000>
# Show the first 5 rows and 5 columns of the first sample
analysis(boot_sample$splits[[1]]) %>% .[1:5, 1:5]
     V1        id member_id loan_amnt funded_amnt
1 70066 143521620        -1     18000       18000
2 74725 143608122        -1     15000       15000
3 64235 122327224        -1     10000       10000
4 26743  70312012        -1     35000       35000
5 80520  68407752        -1     11000       11000

Each bootstrap sample consists of random draws with replacement from the original dataset, creating controlled variations that effectively reveal model sensitivity to different data compositions.

Developing the Predictive Model Framework

This tutorial employs logistic regression as the predictive modeling technique. Logistic regression represents the industry standard for credit risk modeling due to its interpretability and regulatory acceptance. The model specification incorporates typical credit variables including loan amount, income, and credit history metrics.

glm_model <- function(df){
  
  # Fit a simple model with a set specification
  mdl <- glm(bad_flag ~
               loan_amnt + funded_amnt + annual_inc + delinq_2yrs +
               inq_last_6mths + mths_since_last_delinq + fico_range_low +
               mths_since_last_record + revol_util + total_pymnt,
             family = "binomial",
             data = df)
  
  # Return fitted values
  return(predict(mdl))
}

# Test the function
# Retrieve a data frame
train <- analysis(boot_sample$splits[[1]])

# Predict
pred <- glm_model(train)

# Check output
range(pred)  # Output is on log odds scale
[1] -30.25449   1.57891

The function returns predictions in log-odds format, which will subsequently be transformed to a more intuitive credit score scale in later steps.

Iterative Model Training and Prediction Collection

# First apply the glm fitting function to each of the sample
# Note the use of lapply
output <- lapply(boot_sample$splits, function(x){
  train <- analysis(x)
  pred <- glm_model(train)

  return(pred)
})

# Collate all predictions into a vector 
boot_preds <- do.call(c, output)
range(boot_preds)
[1] -130.406408    4.125475
# Get outliers
q_high <- quantile(boot_preds, 0.99)
q_low <- quantile(boot_preds, 0.01)

# Truncate the overall distribution to within the lower 1% and upper 1% quantiles
# Doing this since it creates issues later on when scaling the output
boot_preds[boot_preds > q_high] <- q_high
boot_preds[boot_preds < q_low] <- q_low

range(boot_preds)
[1] -5.0603991 -0.2283903
# Convert to a data frame
boot_preds <- data.frame(pred = boot_preds, 
                         id = rep(1:length(boot_sample$splits), each = nrow(sample)))
head(boot_preds)
       pred id
1 -1.460522  1
2 -1.644074  1
3 -3.123883  1
4 -4.236837  1
5 -3.197778  1
6 -3.123942  1

In this step, we apply the logistic regression model to each bootstrap sample and systematically collect the resulting predictions. Subsequently, we truncate extreme values (beyond the 1st and 99th percentiles) to remove outliers—a procedure analogous to capping techniques commonly employed in production credit models.

Transforming Predictions to Credit Score Scale

This section demonstrates the conversion of log-odds predictions to a recognizable credit score format using the industry-standard Points to Double Odds (PDO) methodology. By employing parameters consistent with real-world credit systems (PDO=30, Anchor=700), we transform the model predictions into intuitive scores where higher numerical values indicate lower credit risk.

scaling_func <- function(vec, PDO = 30, OddsAtAnchor = 5, Anchor = 700){
  beta <- PDO / log(2)
  alpha <- Anchor - PDO * OddsAtAnchor
  
  # Simple linear scaling of the log odds
  scr <- alpha - beta * vec  
  
  # Round off
  return(round(scr, 0))
}

boot_preds$scores <- scaling_func(boot_preds$pred, 30, 2, 700)

# Chart the distribution of predictions across all the samples
ggplot(boot_preds, aes(x = scores, color = factor(id))) + 
  geom_density() + 
  theme_minimal() + 
  theme(legend.position = "none") + 
  scale_color_grey() + 
  labs(title = "Predictions from bootstrapped samples", 
       subtitle = "Density function", 
       x = "Predictions (Log odds)", 
       y = "Density")

Quantifying Prediction Uncertainty Across Score Ranges

This section provides a methodology to directly measure prediction reliability variation across different score ranges through the calculation of standard deviation within each score bin. This approach enables precise quantification of uncertainty at different score levels.

# Create bins using quantiles
breaks <- quantile(boot_preds$scores, probs = seq(0, 1, length.out = 20))
boot_preds$bins <- cut(boot_preds$scores, breaks = unique(breaks), include.lowest = T, right = T)

# Chart standard deviation of model predictions across each score bin
boot_preds %>%
  group_by(bins) %>%
  summarise(std_dev = sd(scores)) %>%
  ggplot(aes(x = bins, y = std_dev)) +
  geom_col(color = "black", fill = "#90AACB") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  theme(legend.position = "none") + 
  labs(title = "Variability in model predictions across samples", 
       subtitle = "(measured using standard deviation)", 
       x = "Score Range", 
       y = "Standard Deviation")

As anticipated, the model’s predictions demonstrate enhanced reliability within a specific range of values (700-800), while exhibiting significant variability in the lowest and highest score buckets.

The visualization reveals a characteristic “U-shaped” pattern of prediction variability—a well-documented phenomenon in credit risk modeling. The highest uncertainty manifests in the extreme score ranges (very high and very low scores), while predictions in the middle range demonstrate greater stability. The analysis confirms the initial hypothesis: variability reaches its maximum at score extremes and achieves its minimum in the middle range (600-800). This finding provides direct guidance for credit policy development—scores in the middle range demonstrate the highest reliability, while decisions at the extremes should incorporate additional caution due to elevated uncertainty.

Practical Business Applications

These findings yield direct business applications:

  1. High Score Management: For extremely high scores, implement additional verification steps before automated approval
  2. Low Score Management: For very low scores, consider manual review procedures rather than automatic rejection

Advanced Methodology: Isolating Training Data Effects

Credit: Richard Warnung

For enhanced analytical control, this section presents an alternative approach where models are trained on bootstrap samples while being evaluated on a consistent validation set. This methodology effectively isolates the impact of training data variation on model predictions.

Vs <- function(boot_split){
  # Train model on the bootstrapped data
  train <- analysis(boot_split)
  
  # Fit model
  mdl <- glm(bad_flag ~
               loan_amnt + funded_amnt + annual_inc + delinq_2yrs +
               inq_last_6mths + mths_since_last_delinq + fico_range_low +
               mths_since_last_record + revol_util + total_pymnt,
             family = "binomial",
             data = train)
  
  # Apply to a common validation set
  validate_preds <- predict(mdl, newdata = validate_set)
  
  # Return predictions
  return(validate_preds)
}

This methodology provides enhanced insight into how variations in training data composition affect model predictions, which proves valuable when evaluating model updates in production environments.

# Create overall training and testing datasets 
id <- sample(1:nrow(sample), size = nrow(sample)*0.8, replace = F)

train_data <- sample[id,]
test_data <- sample[-id,]

# Bootstrapped samples are now pulled only from the overall training dataset
boot_sample <- bootstraps(data = train_data, times = 80)

# Using the same function from before but predicting on the same test dataset
glm_model <- function(train, test){
  
  mdl <- glm(bad_flag ~
               loan_amnt + funded_amnt + annual_inc + delinq_2yrs +
               inq_last_6mths + mths_since_last_delinq + fico_range_low +
               mths_since_last_record + revol_util + total_pymnt,
             family = "binomial",
             data = train)
  
  # Return fitted values on the test dataset
  return(predict(mdl, newdata = test))
}

# Apply the glm fitting function to each of the sample
# But predict on the same test dataset
output <- lapply(boot_sample$splits, function(x){
  train <- analysis(x)
  pred <- glm_model(train, test_data)

  return(pred)
})

Key Takeaways

  • Prediction reliability varies significantly across score ranges, with highest uncertainty at distribution extremes
  • Bootstrap methodology provides a practical framework for measuring model uncertainty without additional data requirements
  • Risk management strategies should be adjusted based on score reliability, with enhanced verification for extreme scores
  • The PDO transformation enables intuitive score interpretation while preserving the underlying risk relationships

These techniques can be applied beyond credit scoring to any prediction problem where understanding model uncertainty is critical for decision-making.