Assessing Credit Score Prediction Reliability Using Bootstrap Resampling

R
Credit Risk Analytics
Bootstrapping
Published

November 15, 2024

Credit scoring models tend to perform well in the middle of the score distribution — but reliability drops at the extremes, where data thins out. In this post, let’s use bootstrap resampling to measure how much predictions vary across score ranges and pinpoint where a model can and can’t be fully trusted.

Theoretical Foundation: Understanding Estimation Variance

Smaller sample sizes mean more variance in statistical estimates — especially for extreme values. Central measures like the mean stay relatively stable even with limited data, but tail percentiles (95th, 99th) can swing considerably. In credit scoring, this matters because the highest and lowest scores are precisely the ones living in those unstable tails.

# Number of samples to be drawn from a probability distribution
n_samples <- 1000

# Number of times, sampling should be repeated
repeats <- 100

# Mean and std-dev for a standard normal distribution
mu <- 5
std_dev <- 2

# Sample
samples <- rnorm(n_samples * repeats, mean = 10)

# Fit into a matrix like object with `n_samples' number of rows 
# and `repeats' number of columns
samples <- matrix(samples, nrow = n_samples, ncol = repeats)

# Compute mean across each column
sample_means <- apply(samples, 1, mean)

# Similarly, compute 75% and 95% quantile across each column
sample_75_quantile <- apply(samples, 1, quantile, p = 0.75)
sample_95_quantile <- apply(samples, 1, quantile, p = 0.95)
sample_99_quantile <- apply(samples, 1, quantile, p = 0.99)

# Compare coefficient of variation
sd(sample_means)/mean(sample_means)
[1] 0.01020788
sd(sample_75_quantile)/mean(sample_75_quantile)
[1] 0.01266596
sd(sample_95_quantile)/mean(sample_75_quantile)
[1] 0.01939092
# Plot the distributions
combined_vec <- c(sample_means, sample_75_quantile, sample_95_quantile, sample_99_quantile)

plot(density(sample_means), 
     col = "#6F69AC", 
     lwd = 3, 
     main = "Estimating the mean vs tail quantiles", 
     xlab = "", 
     xlim = c(min(combined_vec), max(combined_vec)))

lines(density(sample_75_quantile), col = "#95DAC1", lwd = 3)
lines(density(sample_95_quantile), col = "#FFEBA1", lwd = 3)
lines(density(sample_99_quantile), col = "#FD6F96", lwd = 3)
grid()

legend("topright", 
       fill = c("#6F69AC", "#95DAC1", "#FFEBA1", "#FD6F96"), 
       legend = c("Mean", "75% Quantile", "95% Quantile", "99% Quantile"), 
       cex = 0.7)

The chart makes the point clearly — the mean distribution (purple) is much tighter than the 99th percentile (pink). The same principle applies in credit scoring: scores at the extremes carry more uncertainty, almost by definition.

# Load required packages
library(dplyr)
library(magrittr)
library(rsample)
library(ggplot2)

Data Acquisition and Preprocessing

In this post, let’s use a sample from the Lending Club dataset. Loans classified as “Charged Off” are treated as defaults. The class imbalance here is typical of real credit portfolios and is a key reason predictions become less reliable at the score extremes.

# Load sample data (sample of the lending club data)
sample <- read.csv("http://bit.ly/42ypcnJ")

# Mark which loan status will be tagged as default
codes <- c("Charged Off", "Does not meet the credit policy. Status:Charged Off")

# Apply above codes and create target
sample %<>% mutate(bad_flag = ifelse(loan_status %in% codes, 1, 0))

# Replace missing values with a default value
sample[is.na(sample)] <- -1

# Get summary tally
table(sample$bad_flag)

   0    1 
8838 1162 

Implementing Bootstrap Resampling Strategy

Let’s create 100 bootstrap samples to measure how much model predictions vary across score ranges. Bootstrap resampling generates multiple simulated datasets from the original data, so we can quantify prediction uncertainty without collecting any additional observations.

# Create 100 samples
boot_sample <- bootstraps(data = sample, times = 100)

head(boot_sample, 3)
# A tibble: 3 × 2
  splits               id          
  <list>               <chr>       
1 <split [10000/3697]> Bootstrap001
2 <split [10000/3696]> Bootstrap002
3 <split [10000/3639]> Bootstrap003
# Each row represents a separate bootstrapped sample with an analysis set and assessment set
boot_sample$splits[[1]]
<Analysis/Assess/Total>
<10000/3697/10000>
# Show the first 5 rows and 5 columns of the first sample
analysis(boot_sample$splits[[1]]) %>% .[1:5, 1:5]
     V1        id member_id loan_amnt funded_amnt
1 14192  46702803        -1     13050       13050
2 31933  91397132        -1     10000       10000
3 79661 125260410        -1      8650        8650
4 79515  54535222        -1      2700        2700
5 24356  12686393        -1     35000       35000

Each bootstrap sample consists of random draws with replacement from the original dataset, creating controlled variations that effectively reveal model sensitivity to different data compositions.

Developing the Predictive Model Framework

This post uses logistic regression — the industry standard for credit risk modeling, valued for its interpretability and regulatory acceptance. The model includes typical credit variables: loan amount, income, and credit history metrics.

glm_model <- function(df){
  
  # Fit a simple model with a set specification
  mdl <- glm(bad_flag ~
               loan_amnt + funded_amnt + annual_inc + delinq_2yrs +
               inq_last_6mths + mths_since_last_delinq + fico_range_low +
               mths_since_last_record + revol_util + total_pymnt,
             family = "binomial",
             data = df)
  
  # Return fitted values
  return(predict(mdl))
}

# Test the function
# Retrieve a data frame
train <- analysis(boot_sample$splits[[1]])

# Predict
pred <- glm_model(train)

# Check output
range(pred)  # Output is on log odds scale
[1] -27.884057   1.082999

The function returns predictions in log-odds format, which will subsequently be transformed to a more intuitive credit score scale in later steps.

Iterative Model Training and Prediction Collection

# First apply the glm fitting function to each of the sample
# Note the use of lapply
output <- lapply(boot_sample$splits, function(x){
  train <- analysis(x)
  pred <- glm_model(train)

  return(pred)
})

# Collate all predictions into a vector 
boot_preds <- do.call(c, output)
range(boot_preds)
[1] -132.89625    3.16478
# Get outliers
q_high <- quantile(boot_preds, 0.99)
q_low <- quantile(boot_preds, 0.01)

# Truncate the overall distribution to within the lower 1% and upper 1% quantiles
# Doing this since it creates issues later on when scaling the output
boot_preds[boot_preds > q_high] <- q_high
boot_preds[boot_preds < q_low] <- q_low

range(boot_preds)
[1] -5.0670232 -0.2304028
# Convert to a data frame
boot_preds <- data.frame(pred = boot_preds, 
                         id = rep(1:length(boot_sample$splits), each = nrow(sample)))
head(boot_preds)
       pred id
1 -1.415344  1
2 -2.363616  1
3 -1.758686  1
4 -2.551845  1
5 -4.488183  1
6 -1.690804  1

Here, let’s apply the logistic regression model to each bootstrap sample and collect the resulting predictions. Extreme values beyond the 1st and 99th percentiles are then truncated to remove outliers — the same capping approach used in production credit models.

Transforming Predictions to Credit Score Scale

Let’s convert the log-odds predictions into a familiar credit score format using the industry-standard Points to Double Odds (PDO) methodology. With parameters typical of real-world credit systems (PDO=30, Anchor=700), scores are scaled so that higher values indicate lower credit risk.

scaling_func <- function(vec, PDO = 30, OddsAtAnchor = 5, Anchor = 700){
  beta <- PDO / log(2)
  alpha <- Anchor - PDO * OddsAtAnchor
  
  # Simple linear scaling of the log odds
  scr <- alpha - beta * vec  
  
  # Round off
  return(round(scr, 0))
}

boot_preds$scores <- scaling_func(boot_preds$pred, 30, 2, 700)

# Chart the distribution of predictions across all the samples
ggplot(boot_preds, aes(x = scores, color = factor(id))) + 
  geom_density() + 
  theme_minimal() + 
  theme(legend.position = "none") + 
  scale_color_grey() + 
  labs(title = "Predictions from bootstrapped samples", 
       subtitle = "Density function", 
       x = "Predictions (Log odds)", 
       y = "Density")

Quantifying Prediction Uncertainty Across Score Ranges

Let’s measure prediction reliability directly by calculating the standard deviation of scores within each bin — a straightforward way to see where the model is stable and where it isn’t.

# Create bins using quantiles
breaks <- quantile(boot_preds$scores, probs = seq(0, 1, length.out = 20))
boot_preds$bins <- cut(boot_preds$scores, breaks = unique(breaks), include.lowest = T, right = T)

# Chart standard deviation of model predictions across each score bin
boot_preds %>%
  group_by(bins) %>%
  summarise(std_dev = sd(scores)) %>%
  ggplot(aes(x = bins, y = std_dev)) +
  geom_col(color = "black", fill = "#90AACB") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  theme(legend.position = "none") + 
  labs(title = "Variability in model predictions across samples", 
       subtitle = "(measured using standard deviation)", 
       x = "Score Range", 
       y = "Standard Deviation")

As anticipated, the model’s predictions demonstrate enhanced reliability within a specific range of values (700-800), while exhibiting significant variability in the lowest and highest score buckets.

The visualization reveals a characteristic “U-shaped” pattern of prediction variability—a well-documented phenomenon in credit risk modeling. The highest uncertainty manifests in the extreme score ranges (very high and very low scores), while predictions in the middle range demonstrate greater stability. The analysis confirms the initial hypothesis: variability reaches its maximum at score extremes and achieves its minimum in the middle range (600-800). This finding provides direct guidance for credit policy development—scores in the middle range demonstrate the highest reliability, while decisions at the extremes should incorporate additional caution due to elevated uncertainty.

Practical Business Applications

These findings yield direct business applications:

  1. High Score Management: For extremely high scores, implement additional verification steps before automated approval
  2. Low Score Management: For very low scores, consider manual review procedures rather than automatic rejection

Advanced Methodology: Isolating Training Data Effects

Credit: Richard Warnung

For tighter analytical control, let’s try an alternative approach: train models on bootstrap samples, but evaluate them on a fixed validation set. This isolates the effect of training data variation on predictions.

Vs <- function(boot_split){
  # Train model on the bootstrapped data
  train <- analysis(boot_split)
  
  # Fit model
  mdl <- glm(bad_flag ~
               loan_amnt + funded_amnt + annual_inc + delinq_2yrs +
               inq_last_6mths + mths_since_last_delinq + fico_range_low +
               mths_since_last_record + revol_util + total_pymnt,
             family = "binomial",
             data = train)
  
  # Apply to a common validation set
  validate_preds <- predict(mdl, newdata = validate_set)
  
  # Return predictions
  return(validate_preds)
}

This approach gives clearer insight into how training data variation affects predictions — particularly useful when evaluating model updates in production.

# Create overall training and testing datasets 
id <- sample(1:nrow(sample), size = nrow(sample)*0.8, replace = F)

train_data <- sample[id,]
test_data <- sample[-id,]

# Bootstrapped samples are now pulled only from the overall training dataset
boot_sample <- bootstraps(data = train_data, times = 80)

# Using the same function from before but predicting on the same test dataset
glm_model <- function(train, test){
  
  mdl <- glm(bad_flag ~
               loan_amnt + funded_amnt + annual_inc + delinq_2yrs +
               inq_last_6mths + mths_since_last_delinq + fico_range_low +
               mths_since_last_record + revol_util + total_pymnt,
             family = "binomial",
             data = train)
  
  # Return fitted values on the test dataset
  return(predict(mdl, newdata = test))
}

# Apply the glm fitting function to each of the sample
# But predict on the same test dataset
output <- lapply(boot_sample$splits, function(x){
  train <- analysis(x)
  pred <- glm_model(train, test_data)

  return(pred)
})

Key Takeaways

  • Prediction reliability varies significantly across score ranges, with highest uncertainty at distribution extremes
  • Bootstrap methodology provides a practical framework for measuring model uncertainty without additional data requirements
  • Risk management strategies should be adjusted based on score reliability, with enhanced verification for extreme scores
  • The PDO transformation enables intuitive score interpretation while preserving the underlying risk relationships

These techniques can be applied beyond credit scoring to any prediction problem where understanding model uncertainty is critical for decision-making.