Tutorial: Assessing Credit Score Prediction Reliability Using Bootstrap Resampling
R
Credit Risk Analytics
Bootstrapping
Published
November 15, 2024
Introduction
Credit scoring models demonstrate optimal performance within the central regions of the score distribution, yet exhibit diminished reliability at the distribution extremes where data becomes sparse. This tutorial provides a comprehensive methodology for employing bootstrap resampling techniques to quantify prediction variability across different score ranges, thereby enabling practitioners to identify regions where their models demonstrate the highest degree of dependability.
Reduced sample sizes inherently result in increased variance in statistical estimates, particularly for extreme values within a distribution. While central tendency measures such as means demonstrate relative stability when estimated from limited data, tail percentiles (95th, 99th) exhibit substantially greater volatility. This phenomenon is of critical importance in credit scoring applications, where extremely high and low scores correspond to these unstable tail regions of the distribution.
# Number of samples to be drawn from a probability distributionn_samples <-1000# Number of times, sampling should be repeatedrepeats <-100# Mean and std-dev for a standard normal distributionmu <-5std_dev <-2# Samplesamples <-rnorm(n_samples * repeats, mean =10)# Fit into a matrix like object with `n_samples' number of rows # and `repeats' number of columnssamples <-matrix(samples, nrow = n_samples, ncol = repeats)# Compute mean across each columnsample_means <-apply(samples, 1, mean)# Similarly, compute 75% and 95% quantile across each columnsample_75_quantile <-apply(samples, 1, quantile, p =0.75)sample_95_quantile <-apply(samples, 1, quantile, p =0.95)sample_99_quantile <-apply(samples, 1, quantile, p =0.99)# Compare coefficient of variationsd(sample_means)/mean(sample_means)
[1] 0.0100664
sd(sample_75_quantile)/mean(sample_75_quantile)
[1] 0.01270263
sd(sample_95_quantile)/mean(sample_75_quantile)
[1] 0.01888391
# Plot the distributionscombined_vec <-c(sample_means, sample_75_quantile, sample_95_quantile, sample_99_quantile)plot(density(sample_means), col ="#6F69AC", lwd =3, main ="Estimating the mean vs tail quantiles", xlab ="", xlim =c(min(combined_vec), max(combined_vec)))lines(density(sample_75_quantile), col ="#95DAC1", lwd =3)lines(density(sample_95_quantile), col ="#FFEBA1", lwd =3)lines(density(sample_99_quantile), col ="#FD6F96", lwd =3)grid()legend("topright", fill =c("#6F69AC", "#95DAC1", "#FFEBA1", "#FD6F96"), legend =c("Mean", "75% Quantile", "95% Quantile", "99% Quantile"), cex =0.7)
The visualization demonstrates the substantial increase in uncertainty when estimating extreme values compared to central tendencies. The distribution corresponding to the mean (purple) exhibits considerably narrower dispersion than that of the 99th percentile (pink). This statistical principle directly applies to credit scoring applications, where scores at the extremes of the distribution inherently possess greater uncertainty.
In this tutorial, we will utilize a sample from the Lending Club dataset. Loans classified as “Charged Off” will be designated as defaults. The observed class imbalance represents a typical characteristic of credit portfolios and contributes significantly to prediction challenges at distribution extremes.
# Load sample data (sample of the lending club data)sample <-read.csv("http://bit.ly/42ypcnJ")# Mark which loan status will be tagged as defaultcodes <-c("Charged Off", "Does not meet the credit policy. Status:Charged Off")# Apply above codes and create targetsample %<>%mutate(bad_flag =ifelse(loan_status %in% codes, 1, 0))# Replace missing values with a default valuesample[is.na(sample)] <--1# Get summary tallytable(sample$bad_flag)
0 1
8838 1162
Implementing Bootstrap Resampling Strategy
This section demonstrates the creation of 100 bootstrap samples to quantify the variation in model predictions across different score ranges. Bootstrap resampling is a statistical technique that generates multiple simulated datasets to measure prediction uncertainty without requiring additional data collection.
# Create 100 samplesboot_sample <-bootstraps(data = sample, times =100)head(boot_sample, 3)
Each bootstrap sample consists of random draws with replacement from the original dataset, creating controlled variations that effectively reveal model sensitivity to different data compositions.
Developing the Predictive Model Framework
This tutorial employs logistic regression as the predictive modeling technique. Logistic regression represents the industry standard for credit risk modeling due to its interpretability and regulatory acceptance. The model specification incorporates typical credit variables including loan amount, income, and credit history metrics.
glm_model <-function(df){# Fit a simple model with a set specification mdl <-glm(bad_flag ~ loan_amnt + funded_amnt + annual_inc + delinq_2yrs + inq_last_6mths + mths_since_last_delinq + fico_range_low + mths_since_last_record + revol_util + total_pymnt,family ="binomial",data = df)# Return fitted valuesreturn(predict(mdl))}# Test the function# Retrieve a data frametrain <-analysis(boot_sample$splits[[1]])# Predictpred <-glm_model(train)# Check outputrange(pred) # Output is on log odds scale
[1] -30.25449 1.57891
The function returns predictions in log-odds format, which will subsequently be transformed to a more intuitive credit score scale in later steps.
Iterative Model Training and Prediction Collection
# First apply the glm fitting function to each of the sample# Note the use of lapplyoutput <-lapply(boot_sample$splits, function(x){ train <-analysis(x) pred <-glm_model(train)return(pred)})# Collate all predictions into a vector boot_preds <-do.call(c, output)range(boot_preds)
[1] -130.406408 4.125475
# Get outliersq_high <-quantile(boot_preds, 0.99)q_low <-quantile(boot_preds, 0.01)# Truncate the overall distribution to within the lower 1% and upper 1% quantiles# Doing this since it creates issues later on when scaling the outputboot_preds[boot_preds > q_high] <- q_highboot_preds[boot_preds < q_low] <- q_lowrange(boot_preds)
[1] -5.0603991 -0.2283903
# Convert to a data frameboot_preds <-data.frame(pred = boot_preds, id =rep(1:length(boot_sample$splits), each =nrow(sample)))head(boot_preds)
pred id
1 -1.460522 1
2 -1.644074 1
3 -3.123883 1
4 -4.236837 1
5 -3.197778 1
6 -3.123942 1
In this step, we apply the logistic regression model to each bootstrap sample and systematically collect the resulting predictions. Subsequently, we truncate extreme values (beyond the 1st and 99th percentiles) to remove outliers—a procedure analogous to capping techniques commonly employed in production credit models.
Transforming Predictions to Credit Score Scale
This section demonstrates the conversion of log-odds predictions to a recognizable credit score format using the industry-standard Points to Double Odds (PDO) methodology. By employing parameters consistent with real-world credit systems (PDO=30, Anchor=700), we transform the model predictions into intuitive scores where higher numerical values indicate lower credit risk.
scaling_func <-function(vec, PDO =30, OddsAtAnchor =5, Anchor =700){ beta <- PDO /log(2) alpha <- Anchor - PDO * OddsAtAnchor# Simple linear scaling of the log odds scr <- alpha - beta * vec # Round offreturn(round(scr, 0))}boot_preds$scores <-scaling_func(boot_preds$pred, 30, 2, 700)# Chart the distribution of predictions across all the samplesggplot(boot_preds, aes(x = scores, color =factor(id))) +geom_density() +theme_minimal() +theme(legend.position ="none") +scale_color_grey() +labs(title ="Predictions from bootstrapped samples", subtitle ="Density function", x ="Predictions (Log odds)", y ="Density")
Quantifying Prediction Uncertainty Across Score Ranges
This section provides a methodology to directly measure prediction reliability variation across different score ranges through the calculation of standard deviation within each score bin. This approach enables precise quantification of uncertainty at different score levels.
# Create bins using quantilesbreaks <-quantile(boot_preds$scores, probs =seq(0, 1, length.out =20))boot_preds$bins <-cut(boot_preds$scores, breaks =unique(breaks), include.lowest = T, right = T)# Chart standard deviation of model predictions across each score binboot_preds %>%group_by(bins) %>%summarise(std_dev =sd(scores)) %>%ggplot(aes(x = bins, y = std_dev)) +geom_col(color ="black", fill ="#90AACB") +theme_minimal() +theme(axis.text.x =element_text(angle =90)) +theme(legend.position ="none") +labs(title ="Variability in model predictions across samples", subtitle ="(measured using standard deviation)", x ="Score Range", y ="Standard Deviation")
As anticipated, the model’s predictions demonstrate enhanced reliability within a specific range of values (700-800), while exhibiting significant variability in the lowest and highest score buckets.
The visualization reveals a characteristic “U-shaped” pattern of prediction variability—a well-documented phenomenon in credit risk modeling. The highest uncertainty manifests in the extreme score ranges (very high and very low scores), while predictions in the middle range demonstrate greater stability. The analysis confirms the initial hypothesis: variability reaches its maximum at score extremes and achieves its minimum in the middle range (600-800). This finding provides direct guidance for credit policy development—scores in the middle range demonstrate the highest reliability, while decisions at the extremes should incorporate additional caution due to elevated uncertainty.
Practical Business Applications
These findings yield direct business applications:
High Score Management: For extremely high scores, implement additional verification steps before automated approval
Low Score Management: For very low scores, consider manual review procedures rather than automatic rejection
Advanced Methodology: Isolating Training Data Effects
Credit: Richard Warnung
For enhanced analytical control, this section presents an alternative approach where models are trained on bootstrap samples while being evaluated on a consistent validation set. This methodology effectively isolates the impact of training data variation on model predictions.
Vs <-function(boot_split){# Train model on the bootstrapped data train <-analysis(boot_split)# Fit model mdl <-glm(bad_flag ~ loan_amnt + funded_amnt + annual_inc + delinq_2yrs + inq_last_6mths + mths_since_last_delinq + fico_range_low + mths_since_last_record + revol_util + total_pymnt,family ="binomial",data = train)# Apply to a common validation set validate_preds <-predict(mdl, newdata = validate_set)# Return predictionsreturn(validate_preds)}
This methodology provides enhanced insight into how variations in training data composition affect model predictions, which proves valuable when evaluating model updates in production environments.
# Create overall training and testing datasets id <-sample(1:nrow(sample), size =nrow(sample)*0.8, replace = F)train_data <- sample[id,]test_data <- sample[-id,]# Bootstrapped samples are now pulled only from the overall training datasetboot_sample <-bootstraps(data = train_data, times =80)# Using the same function from before but predicting on the same test datasetglm_model <-function(train, test){ mdl <-glm(bad_flag ~ loan_amnt + funded_amnt + annual_inc + delinq_2yrs + inq_last_6mths + mths_since_last_delinq + fico_range_low + mths_since_last_record + revol_util + total_pymnt,family ="binomial",data = train)# Return fitted values on the test datasetreturn(predict(mdl, newdata = test))}# Apply the glm fitting function to each of the sample# But predict on the same test datasetoutput <-lapply(boot_sample$splits, function(x){ train <-analysis(x) pred <-glm_model(train, test_data)return(pred)})
Key Takeaways
Prediction reliability varies significantly across score ranges, with highest uncertainty at distribution extremes
Bootstrap methodology provides a practical framework for measuring model uncertainty without additional data requirements
Risk management strategies should be adjusted based on score reliability, with enhanced verification for extreme scores
The PDO transformation enables intuitive score interpretation while preserving the underlying risk relationships
These techniques can be applied beyond credit scoring to any prediction problem where understanding model uncertainty is critical for decision-making.