๐ KS Statistic — A Simple, Clear Explanation
๐ฏ What Problem Does KS Solve?
Imagine you're a bank. You built a model that gives every loan applicant a risk score (0 to 1). Higher score = more likely to default.
Now the big question:
"How well does my model separate the bad guys (defaulters) from the good guys (non-defaulters)?"
KS Statistic answers exactly this. It finds the point where the gap between the two groups is the widest.
๐ง The Core Idea (No Math Yet)
Think of two queues:
๐ด Queue A: All defaulters, lined up by their model score (highest first)
๐ข Queue B: All non-defaulters, lined up the same way
Now you slowly lower a threshold from 1.0 → 0.0 and at each step ask:
"What % of Queue A have I captured so far?"
vs
"What % of Queue B have I captured so far?"
If the model is good, you'll capture defaulters much faster than non-defaulters.
The biggest gap between these two percentages = KS Statistic.
๐ Worked Example — 10 Loan Applicants
Here are 10 people, their actual outcome, and the model's predicted score:
| Person | Actually Defaulted? | Model Score |
|---|---|---|
| A | ✅ Yes | 0.95 |
| B | ❌ No | 0.90 |
| C | ✅ Yes | 0.85 |
| D | ❌ No | 0.80 |
| E | ❌ No | 0.70 |
| F | ✅ Yes | 0.60 |
| G | ❌ No | 0.40 |
| H | ❌ No | 0.30 |
| I | ✅ Yes | 0.20 |
| J | ❌ No | 0.10 |
๐ Totals
๐ด Defaulters = 4
๐ข Non-defaulters = 6
Data is already sorted by score (highest first). Now we walk top to bottom.
๐ถ Walk-Through: Row by Row
At each row, we track two running counters:
๐ด Defaulters captured so far → out of 4 total
๐ข Non-defaulters captured so far → out of 6 total
| Row | Person | Defaulted? | Score | Defaulters Captured | Non-Defaulters Captured | Gap |
|---|---|---|---|---|---|---|
| 1 | A | ✅ Yes | 0.95 | 1 out of 4 = 25.0% | 0 out of 6 = 0.0% | 25.0% |
| 2 | B | ❌ No | 0.90 | 1 out of 4 = 25.0% | 1 out of 6 = 16.7% | 8.3% |
| 3 | C | ✅ Yes | 0.85 | 2 out of 4 = 50.0% | 1 out of 6 = 16.7% | 33.3% ⭐ |
| 4 | D | ❌ No | 0.80 | 2 out of 4 = 50.0% | 2 out of 6 = 33.3% | 16.7% |
| 5 | E | ❌ No | 0.70 | 2 out of 4 = 50.0% | 3 out of 6 = 50.0% | 0.0% |
| 6 | F | ✅ Yes | 0.60 | 3 out of 4 = 75.0% | 3 out of 6 = 50.0% | 25.0% |
| 7 | G | ❌ No | 0.40 | 3 out of 4 = 75.0% | 4 out of 6 = 66.7% | 8.3% |
| 8 | H | ❌ No | 0.30 | 3 out of 4 = 75.0% | 5 out of 6 = 83.3% | 8.3% |
| 9 | I | ✅ Yes | 0.20 | 4 out of 4 = 100.0% | 5 out of 6 = 83.3% | 16.7% |
| 10 | J | ❌ No | 0.10 | 4 out of 4 = 100.0% | 6 out of 6 = 100.0% | 0.0% |
๐ Result
KS = 33.3% (0.333) at Row 3 (score = 0.85)
At that point, the model has already captured half of all defaulters but only 16.7% of non-defaulters.
That's the best separation it achieves.
๐ก Why Does Row 3 Matter?
Think of it practically.
If you said:
"Reject everyone with score ≥ 0.85"
Then:
✅ You'd catch 50% of the people who would actually default
✅ You'd only wrongly reject 16.7% of the good customers
๐ฏ That's the sweet spot where the model discriminates best
๐ How Good is Your KS?
| KS Value | Verdict | What It Means |
|---|---|---|
| < 0.20 | ❌ Poor | Model barely separates the two groups |
| 0.20 – 0.30 | ⚠️ Fair | Some separation, needs improvement |
| 0.30 – 0.40 | ✅ Good | Decent discrimination |
| > 0.40 | ๐ Excellent | Strong separation |
๐ Our Example
KS = 0.333 → ✅ Good Model
๐ Nice separation between defaulters and non-defaulters.
⚠️ Common Misconceptions
| Misconception | Reality |
|---|---|
| "KS point = the best threshold to use in production" | Not always. Business costs (e.g., cost of missing a defaulter vs. rejecting a good customer) should also drive threshold choice. |
| "Higher KS is always better" | A KS close to 1.0 on training data likely means overfitting. Be suspicious. |
| "KS works for all problems" | KS is best suited for binary classification with continuous scores. For multi-class or ranking problems, other metrics are more appropriate. |
๐งพ One-Line Summary
KS Statistic = the maximum gap between the cumulative % of positives and cumulative % of negatives, as you scan predictions from highest to lowest. It tells you how well your model separates the two classes.
No comments:
Post a Comment