Measuring model performance using a gains table : R'tichoke

Modellers/analysts developing credit scores generally use something known as the gains table (or a ks table) to measure and quantify the performance of such models. We’ll explore how to build such a table in this post.

The idea is to first discretise the population under consideration (say the testing or validation set) into groups based on the model’s output (probability/log odds/scores). Typically, this is done in a way such that each group represents 10% of the total population (or deciles). Then, summary statistics are generated for each group and cumulative distributions of events and non-events are analysed and the model’s performance is quantified.

We’ll use dplyr in this post. Doing it this way is nice since it’s easier to read and the code would need minimal changes if using sparklyr (say within a big data environment where one might need to run this directly on a hadoop table).

Packages

Let’s get package installation out of the way first.

# Pacman is a package management tool 
install.packages("pacman")

library(pacman)

# p_load automatically installs packages if needed
p_load(dplyr, magrittr, knitr, scales)

Sample dataset

Here’s some sample data to play around with. The data set is small sample of the Lending Club dataset available on Kaggle.

sample <- read.csv("credit_sample.csv")

dim(sample)
## [1] 10000   153

class(sample)
## [1] "data.frame"

Defining a target

First, we need to create a target (outcome) to model for. Since this is a credit risk use case, we are looking to create a target which identifies borrowers who defaulted on (or missed) their payments consecutively.

unique(sample$loan_status)
## [1] "Fully Paid"                                         
## [2] "Current"                                            
## [3] "Charged Off"                                        
## [4] "Late (31-120 days)"                                 
## [5] "Late (16-30 days)"                                  
## [6] "In Grace Period"                                    
## [7] "Does not meet the credit policy. Status:Fully Paid" 
## [8] "Does not meet the credit policy. Status:Charged Off"

# For simplicity we'll just use 
# 1. "Charged Off"
# 2. "Does not meet the credit policy. Status:Charged Off"
codes <- c("Charged Off", "Does not meet the credit policy. Status:Charged Off")

# For details on the %<>% operator please look at the 
# documentation for the magrittr package. 
sample %<>% mutate(bad_flag = ifelse(loan_status %in% codes, 1, 0))

Let’s also check for overall event rates.

sample %>% 
  summarise(events = sum(bad_flag == 1), 
            non_events = sum(bad_flag == 0)) %>% 
  mutate(event_rate = events/(events + non_events))
##   events non_events event_rate
## 1   1162       8838     0.1162

Building a model

Next lets build a quick and dirty model, the output of which we will use to build the gains table.

# Check out available features (not shown here for brevity)
colnames(sample)

We’ll need to do some data cleaning first.

# Replace all NA values with a default value
sample[is.na(sample)] <- -1

sample %<>% 
  
  # Remove cases where home ownership and payment plan are not reported
  filter(! home_ownership %in% c("", "NONE"),
         pymnt_plan != "") %>% 
  
  # Convert these two variables into factors
  mutate(home_ownership = factor(home_ownership), 
         pymnt_plan = factor(pymnt_plan))

# Train Test split
idx <- sample(1:nrow(sample), size = 0.7 * nrow(sample), replace = F)
train <- sample[idx,]
test <- sample[-idx,]

dim(train)
## [1] 6999  153
dim(test)
## [1] 3000  153

# Using a GLM model for simplicity
mdl <- glm(
  formula = bad_flag ~ 
    loan_amnt + term + mths_since_last_delinq + total_pymnt + 
    home_ownership + acc_now_delinq + 
    inq_last_6mths + delinq_amnt + 
    mths_since_last_record + mths_since_recent_revol_delinq + 
    mths_since_last_major_derog + mths_since_recent_inq + 
    mths_since_recent_bc + num_accts_ever_120_pd,
  family = "binomial", 
  data = train
)

While we are at it, let’s also attach the model predictions to the test dataset.

test$pred <- predict(mdl, newdata = test)

Creating the Gains Table

The table has a few important components:

Bins in decreasing/increasing order of model output (probability/log odds/scores)
Population percentages contained in each bin
Observed event rates in each bin
Cumulative events and non events distributions

# Discretise predictions based on quantiles
q <- quantile(test$pred, probs = seq(0, 1, length.out = 11))

# Add bins to test dataset
test$bins <- cut(test$pred, breaks = q, include.lowest = T, right = T, ordered_result = T)

info Note that the output from cut is arranged in increasing order of value

levels(test$bins)
##  [1] "[-5.97,-3.35]" "(-3.35,-2.9]"  "(-2.9,-2.64]"  "(-2.64,-2.44]"
##  [5] "(-2.44,-2.26]" "(-2.26,-2.1]"  "(-2.1,-1.9]"   "(-1.9,-1.65]" 
##  [9] "(-1.65,-1.28]" "(-1.28,0.638]"

Using the bins we created above, we can now start to put the table together

# Start with the test dataset and summarise
gains_table <- test %>% 
  group_by(bins) %>% 
  summarise(total = n(), 
            events = sum(bad_flag == 1), 
            non_events = sum(bad_flag == 0))

At this point the table should look something like this:

kable(gains_table)

bins	total	events	non_events
[-5.97,-3.35]	300	2	298
(-3.35,-2.9]	300	15	285
(-2.9,-2.64]	300	14	286
(-2.64,-2.44]	300	19	281
(-2.44,-2.26]	300	24	276
(-2.26,-2.1]	300	38	262
(-2.1,-1.9]	300	52	248
(-1.9,-1.65]	300	56	244
(-1.65,-1.28]	300	63	237
(-1.28,0.638]	300	84	216

Next, we’ll add the event rate columns. Let’s also make the table presentable - I’ll use the percent() function in the scales package to show numbers as percentages.

gains_table %<>%
  mutate(event_rate = percent(events / total, 0.1, 100))

kable(gains_table)

bins	total	events	non_events	event_rate
[-5.97,-3.35]	300	2	298	0.7%
(-3.35,-2.9]	300	15	285	5.0%
(-2.9,-2.64]	300	14	286	4.7%
(-2.64,-2.44]	300	19	281	6.3%
(-2.44,-2.26]	300	24	276	8.0%
(-2.26,-2.1]	300	38	262	12.7%
(-2.1,-1.9]	300	52	248	17.3%
(-1.9,-1.65]	300	56	244	18.7%
(-1.65,-1.28]	300	63	237	21.0%
(-1.28,0.638]	300	84	216	28.0%

To this we’ll add some columns quantifying how events and non events are distributed across each bin.

gains_table %<>%
  mutate(pop_pct = percent(total/sum(total), 0.1, 100), 
         
         # Not formatting these as percentages just yet
         c.events_pct = cumsum(events) / sum(events),
         c.non_events_pct = cumsum(non_events) / sum(non_events))

kable(gains_table)

bins	total	events	non_events	event_rate	pop_pct	c.events_pct	c.non_events_pct
[-5.97,-3.35]	300	2	298	0.7%	10.0%	0.0054496	0.1131789
(-3.35,-2.9]	300	15	285	5.0%	10.0%	0.0463215	0.2214204
(-2.9,-2.64]	300	14	286	4.7%	10.0%	0.0844687	0.3300418
(-2.64,-2.44]	300	19	281	6.3%	10.0%	0.1362398	0.4367641
(-2.44,-2.26]	300	24	276	8.0%	10.0%	0.2016349	0.5415875
(-2.26,-2.1]	300	38	262	12.7%	10.0%	0.3051771	0.6410938
(-2.1,-1.9]	300	52	248	17.3%	10.0%	0.4468665	0.7352829
(-1.9,-1.65]	300	56	244	18.7%	10.0%	0.5994550	0.8279529
(-1.65,-1.28]	300	63	237	21.0%	10.0%	0.7711172	0.9179643
(-1.28,0.638]	300	84	216	28.0%	10.0%	1.0000000	1.0000000

Almost done - just need a few more columns namely:

A column computing the difference between the two cumulative distribution columns we created previously. The maximum value of this column will become the primary performance metric known as the KS statistic.
Two additional columns computing event capture rates and cumulative event rates.

gains_table %<>%
  mutate(ks = round(abs(c.events_pct - c.non_events_pct), 2), 
         cap_rate = percent(cumsum(events)/sum(events), 1, 100), 
         c_event_rate = percent(cumsum(events)/cumsum(total), 0.1, 100), 
         
         # Format pending columns
         c.events_pct = percent(c.events_pct, 0.1, 100),
         c.non_events_pct = percent(c.non_events_pct, 0.1, 100))

kable(gains_table)

bins	total	events	non_events	event_rate	pop_pct	c.events_pct	c.non_events_pct	ks	cap_rate	c_event_rate
[-5.97,-3.35]	300	2	298	0.7%	10.0%	0.5%	11.3%	0.11	1%	0.7%
(-3.35,-2.9]	300	15	285	5.0%	10.0%	4.6%	22.1%	0.18	5%	2.8%
(-2.9,-2.64]	300	14	286	4.7%	10.0%	8.4%	33.0%	0.25	8%	3.4%
(-2.64,-2.44]	300	19	281	6.3%	10.0%	13.6%	43.7%	0.30	14%	4.2%
(-2.44,-2.26]	300	24	276	8.0%	10.0%	20.2%	54.2%	0.34	20%	4.9%
(-2.26,-2.1]	300	38	262	12.7%	10.0%	30.5%	64.1%	0.34	31%	6.2%
(-2.1,-1.9]	300	52	248	17.3%	10.0%	44.7%	73.5%	0.29	45%	7.8%
(-1.9,-1.65]	300	56	244	18.7%	10.0%	59.9%	82.8%	0.23	60%	9.2%
(-1.65,-1.28]	300	63	237	21.0%	10.0%	77.1%	91.8%	0.15	77%	10.5%
(-1.28,0.638]	300	84	216	28.0%	10.0%	100.0%	100.0%	0.00	100%	12.2%

Creating a function

Finally, we can encapsulate all of the above code in a single function. Note that we actually do not need the full test/train dataset, just the actual classes and predicted outcomes (log odds/probability/score).

gains_table <- function(act, pred, increasing = T, nBins = 10){
  
  q <- quantile(pred, probs = seq(0, 1, length.out = nBins + 1))
  bins <- cut(pred, breaks = q, include.lowest = T, right = T, ordered_result = T)
  
  df <- data.frame(act, pred, bins)
  
  df %>% 
    
    group_by(bins) %>% 
    summarise(total = n(), 
              events = sum(act == 1), 
              non_events = sum(act == 0)) %>% 
    mutate(event_rate = percent(events / total, 0.1, 100)) %>% 
    
    # This odd looking format is to ensure that the if-else 
    # condition is part of the dplyr chain
    {if(increasing == TRUE){
      arrange(., bins)
    }else{
      arrange(., desc(bins))
    }} %>% 
    
    mutate(pop_pct = percent(total/sum(total), 0.1, 100), 
           c.events_pct = cumsum(events) / sum(events),
           c.non_events_pct = cumsum(non_events) / sum(non_events), 
           ks = round(abs(c.events_pct - c.non_events_pct), 2), 
           cap_rate = percent(cumsum(events)/sum(events), 1, 100), 
           c_event_rate = percent(cumsum(events)/cumsum(total), 0.1, 100), 
           c.events_pct = percent(c.events_pct, 0.1, 100),
           c.non_events_pct = percent(c.non_events_pct, 0.1, 100))
}

It is worth noting here that since the capture rate is being computed from top to bottom, it is important that the table is arranged in an appropriate manner. That is, when modelling for bads, the table should be arrange in descending order of the model output (i.e. higher event rates at the top) and vice versa.

Also, if you are planning on using this with sparklyr, consider looking into the ft_quantile_discretizer() function. It would replace cut() here.

# Test the function
tab <- gains_table(test$bad_flag, test$pred, F, 10)
kable(tab)

bins	total	events	non_events	event_rate	pop_pct	c.events_pct	c.non_events_pct	ks	cap_rate	c_event_rate
(-1.28,0.638]	300	84	216	28.0%	10.0%	22.9%	8.2%	0.15	23%	28.0%
(-1.65,-1.28]	300	63	237	21.0%	10.0%	40.1%	17.2%	0.23	40%	24.5%
(-1.9,-1.65]	300	56	244	18.7%	10.0%	55.3%	26.5%	0.29	55%	22.6%
(-2.1,-1.9]	300	52	248	17.3%	10.0%	69.5%	35.9%	0.34	69%	21.2%
(-2.26,-2.1]	300	38	262	12.7%	10.0%	79.8%	45.8%	0.34	80%	19.5%
(-2.44,-2.26]	300	24	276	8.0%	10.0%	86.4%	56.3%	0.30	86%	17.6%
(-2.64,-2.44]	300	19	281	6.3%	10.0%	91.6%	67.0%	0.25	92%	16.0%
(-2.9,-2.64]	300	14	286	4.7%	10.0%	95.4%	77.9%	0.18	95%	14.6%
(-3.35,-2.9]	300	15	285	5.0%	10.0%	99.5%	88.7%	0.11	99%	13.5%
[-5.97,-3.35]	300	2	298	0.7%	10.0%	100.0%	100.0%	0.00	100%	12.2%

Interpretation

Some notes on how to interpret such a table:

Since scoring models aim to risk-rank borrowers, the first thing to look for is whether or not the event rates are consistently increasing (or decreasing) across bins. If not, when using the actual model, one might not be able to confidently conclude if borrower A is better (or worse) than borrower B.
If bin sizes are not consistent (in this case ~10%) it would imply that the model is assigning the same output/score to a lot of borrowers (clumping). This could pose issues later on (say when deciding cutoffs). The ideal solution is to add additional variables that can help differentiate between good and bad borrowers.
While the ideal cutoff would be the bin where the KS statistic is at its maximum, additional aspects like capture rates and approval rates should be taken into account.
Typically, analysts would look for a model which achieves the maximum value of the KS statistic within the first 2/3 deciles. That way, when creating underwriting policies, you would only end up rejecting 20%-30% of the applicant pool.

Thoughts? Comments? Helpful? Not helpful? Like to see anything else added in here? Let me know!