Bigdata and data science by Kartheek Dachepalli: Evaluation metrics

Showing posts with label Evaluation metrics. Show all posts

Sunday, August 10, 2025

How R2 is different from RMSE, MAE

Let’s unpack this step-by-step — because R², RMSE, and MAE all involve “difference between actual and predicted,” but they measure different things and answer different questions.

1. R² (Coefficient of Determination) — Variance Captured

Think of your target values (y_actual) as having some spread (variance) around their mean.
If you didn’t have a model and just guessed the mean for everyone, that’s your baseline.
R² asks:

"How much better is my model compared to just guessing the mean every time?"

Formula

R^2 = 1 - \frac{\text{Sum of Squared Errors of Model}}{\text{Sum of Squared Errors of Mean Model}}

Where:

SSE_model = Σ(Actual − Predicted)²
SSE_mean = Σ(Actual − Mean)²

Intuition

R² = 1.0 → Model perfectly predicts all values (100% of variance explained).
R² = 0.0 → Model is no better than guessing the mean.
R² < 0.0 → Model is worse than guessing the mean (ouch).

Example in Credit Limit Prediction

Let’s say actual limits for 5 customers are:

Actual:  10k, 12k, 15k, 20k, 25k
Mean:    16.4k

Variance is the spread around 16.4k.

Case A: Terrible Model

Predicted: 16.4k for everyone (mean model) →
SSE_model = SSE_mean → R² = 0.

Case B: Decent Model

Predicted: 9k, 13k, 14k, 21k, 26k →
SSE_model is much smaller than SSE_mean → R² ≈ 0.85.
This means the model explains 85% of the variation in limits between customers.

2. RMSE & MAE — Error Magnitude

These do not compare to a baseline — they tell you how far off predictions are, on average.
RMSE penalizes large mistakes more heavily than MAE (because it squares the errors before averaging).
Both are absolute accuracy metrics, not relative to variance.

Example with Same Data

If predictions are:

Actual:    10k, 12k, 15k, 20k, 25k
Predicted: 9k, 13k, 14k, 21k, 26k

Errors: 1k, 1k, 1k, 1k, 1k

MAE = (1k + 1k + 1k + 1k + 1k) / 5 = 1k
RMSE = sqrt((1² + 1² + 1² + 1² + 1²) / 5) = 1k
R² = very high, because variance explained is high.

Key Difference

R²: “How much of the pattern in the data did I capture?”
RMSE / MAE: “How far off am I, in the actual unit (e.g., $)?”

You can have:

High R² but high RMSE → You’re good at ranking & trend, but still making large dollar errors.
Low R² but low RMSE → Everyone gets about the same prediction, close to average, but model doesn’t capture much variation between people.

Regression evaluation techniques - credit limit assignment regression model

Scenario

You are building an XGBoost regression model that predicts “How much credit can be safely assigned to a loan requester” based on their application data, financial history, and behavioral data.
The target (y) is a continuous variable — the approved credit limit amount.

1. Accuracy Metrics & Intuition

Since this is regression, classification metrics like AUC or KS don’t apply.
Instead, we use error-based and rank-based metrics:

Metric	Intuition	Example in Credit Limit Context
RMSE (Root Mean Squared Error)	Penalizes large mistakes more heavily. Good for spotting models that make occasional big blunders.	If RMSE = 3,000 and limits range 0–50k, it means large errors like predicting 50k for someone eligible for 10k happen occasionally.
MAE (Mean Absolute Error)	Average size of error regardless of direction. Easier to interpret than RMSE.	MAE = 2,000 means that on average, your credit limit predictions are off by $2,000.
MAPE (Mean Absolute Percentage Error)	Normalizes error relative to actual values.	If MAPE = 15%, predicting 8,500 for someone eligible for 10,000 is a 15% miss.
R² (Coefficient of Determination)	Measures how much variance in actual credit limits your model explains.	R² = 0.65 means the model explains 65% of why different customers get different limits.
Spearman correlation	Focuses on rank ordering — important when exact limit isn’t as critical as ranking applicants from lowest to highest limit eligibility.	Spearman = 0.8 means applicants ranked high by model generally do get higher limits.

2. Typical Acceptable Thresholds for Model Approval

(These are common industry validation ranges — vary by portfolio type)

Metric	Acceptable Range	Why it Matters Here
RMSE	≤ 15–25% of credit limit range	Avoids big over-limit assignments that increase default risk.
MAE	≤ 10–20% of limit range	Keeps average prediction error within acceptable business tolerance.
MAPE	≤ 30–40%	Ensures proportional accuracy across low and high limits.
R²	≥ 0.5 (≥ 0.4 for noisy small-business data)	Shows the model meaningfully explains applicant differences.
Spearman	≥ 0.6–0.7	Keeps rank ordering stable — critical for approval tiers.

3. Why Rank Ordering Can Matter More Than Absolute Accuracy

In credit assignment, the exact dollar figure might be adjusted by policy rules after the model predicts.
What’s critical is ranking applicants correctly:
- If the model thinks Applicant A should have higher limit than Applicant B, that ordering should hold true most of the time.
- This is why Spearman correlation is often a regulatory requirement alongside RMSE.

4. Stability & Predictive Power Checks

For regression, we can still use:

Check	Purpose	Regression Adjustment
PSI (Population Stability Index)	Ensures feature distributions don’t shift drastically between development and monitoring periods.	Use binned feature values, not target values.
CSI (Characteristic Stability Index)	Checks if relationship between feature & target is stable.	For continuous target, bin both feature & target and use mean target per bin for distribution.
Feature Importance	Identify which variables drive credit limits most.	Use gain/cover in XGBoost.
SHAP	Explain predictions to regulators & business.	Works identically for regression.

5. Model Approval Reality

If your RMSE, MAE, MAPE, and Spearman are within target ranges, and PSI/CSI show stability, you’re in a strong position.
Even if one metric is slightly weak, a model can be approved if:
- It beats the current champion model.
- It’s more stable over time.
- It’s more explainable and policy-compliant.
Failing both accuracy and stability → high risk of rejection.

Evaluation techniques for regression

let’s make this concrete with credit engine–style examples so the intuition clicks.

Imagine we’re predicting Loss Given Default (LGD) for loans.
We have actual LGD values from historical defaults and model predictions.

1. RMSE – Root Mean Squared Error

Example:

Actual LGD: [0.10, 0.30, 0.90]
Predicted LGD: [0.12, 0.50, 0.30]

Errors: [0.02, 0.20, -0.60] → squared: [0.0004, 0.04, 0.36] → mean: 0.1335 → sqrt: 0.365

Interpretation:

The 0.60 miss on the last loan blows up the RMSE because squaring makes big mistakes shout louder.
RMSE here says: “Your typical big-mistake-weighted error is 36.5 percentage points.”

2. MAE – Mean Absolute Error

Same example:
Absolute errors: [0.02, 0.20, 0.60] → mean: 0.273

Interpretation:

MAE says: “On average, you’re off by 27.3 percentage points.”
It treats the 0.60 miss the same way as a smaller one, proportionally.

3. MAPE – Mean Absolute Percentage Error

Example:
Absolute % errors:
[0.02/0.10 = 20%, 0.20/0.30 ≈ 66.7%, 0.60/0.90 ≈ 66.7%] → mean ≈ 51.1%

Interpretation:

“On average, you’re off by 51% of the actual LGD value.”
If an actual LGD is close to zero (e.g., 0.01) and you predict 0.10, MAPE goes crazy.

4. R² – Coefficient of Determination

Example:

If actual LGDs vary a lot, and your predictions capture that variation well, R² will be high.
If you just predict the average LGD for everyone, R² might be close to 0 — you didn’t explain any variance.

Interpretation:

R² answers: “How much of the LGD variability did I explain compared to just guessing the mean?”

5. Pearson vs. Spearman Correlation

Example:

Suppose actual LGD ranking (highest to lowest risk): Loan C, Loan B, Loan A.
Predictions: Loan C is still ranked highest, then A, then B.

Pearson: Could be low if the exact values are off (linear mismatch).
Spearman: Could still be high because the order is mostly right.

Interpretation:

Pearson: “Do the numbers line up in a straight-line way?”
Spearman: “Even if I got magnitudes wrong, did I keep the order right?”

Summary in credit engine terms:

RMSE → penalizes big LGD prediction errors heavily (good if big misses are expensive for the bank).
MAE → gives equal weight to all misses, good for stable reporting.
MAPE → interprets error in % terms, useful if LGD has consistent scale across products.
R² → tells if your model adds value beyond a dumb constant guess.
Spearman → good for prioritization tasks (e.g., which borrowers to monitor first).

Sunday, August 3, 2025

ROC-AUC - Step by step calculation

Let’s go through ROC-AUC just like we did for KS — with intuitive explanation, formulas, and a step-by-step example using 10 observations.

📘 What is ROC-AUC?

🟦 ROC = Receiver Operating Characteristic Curve

It plots:

X-axis: False Positive Rate (FPR) = FP / (FP + TN)
Y-axis: True Positive Rate (TPR) = TP / (TP + FN)

Each point on the ROC curve represents a threshold on the predicted probability.

🟧 AUC = Area Under the Curve

AUC = Probability that the model ranks a random positive higher than a random negative
AUC ranges from:
- 1.0 → perfect model
- 0.5 → random guessing
- < 0.5 → worse than random

✅ ROC-AUC Formula (Conceptually)

There are two main interpretations:

1. Integral of the ROC Curve:

AUC = \int_0^1 TPR(FPR) \, dFPR

2. Rank-Based Interpretation (Used in practice):

AUC = \frac{\text{Number of correct positive-negative pairs}}{\text{Total positive-negative pairs}}

📊 Example: 10 Observations

We'll reuse your 10 data points:

Obs	Actual (Y)	Predicted Score
1	1	0.95
2	0	0.90
3	1	0.85
4	0	0.80
5	0	0.70
6	1	0.60
7	0	0.40
8	0	0.30
9	1	0.20
10	0	0.10

Total Positives (P) = 4
Total Negatives (N) = 6

📈 Step-by-Step: Rank-Based AUC Calculation

Let’s find all (positive, negative) score pairs and count how many times:

Positive score > Negative score → Correct
Positive score == Negative score → 0.5 credit
Positive score < Negative score → Wrong

Step 1: List All Positive-Negative Pairs

Positive scores: 0.95, 0.85, 0.60, 0.20
Negative scores: 0.90, 0.80, 0.70, 0.40, 0.30, 0.10

Total Pairs = 4 × 6 = 24

Step 2: Count Favorable Pairs

Pos Score	Compared to Neg Scores	Wins
0.95	> all (0.90 ... 0.10)	6
0.85	> all except 0.90	5
0.60	> 0.40, 0.30, 0.10	3
0.20	> 0.10 only	1
Total		6+5+3+1 = 15 wins

No ties, so:

AUC = \frac{15}{24} = \boxed{0.625}

🧠 Interpretation:

Model has 62.5% chance of ranking a random defaulter higher than a non-defaulter.
Better than random, but not great.

📉 ROC Curve (Optional Idea):

If we plot TPR vs FPR at various thresholds:

Start at (0,0)
End at (1,1)
The area under that curve will match AUC = 0.625

KS Calculation - step by step

let's walk through a step-by-step example of the KS statistic using 10 observations with:

Actuals (ground truth): 1 = defaulter, 0 = non-defaulter
Predicted scores: from a classification model

🧾 Sample Data: 10 Observations

Obs	Actual (Y)	Predicted Score
1	1	0.95
2	0	0.90
3	1	0.85
4	0	0.80
5	0	0.70
6	1	0.60
7	0	0.40
8	0	0.30
9	1	0.20
10	0	0.10

📊 Step 1: Sort by predicted score descending

Rank	Actual (Y)	Score	Cumulative Positives	Cumulative Negatives	(+ve%) - (-ve%)
1	1	0.95	1 / 4 = 0.25	0 / 6 = 0.00	0.25
2	0	0.90	1 / 4 = 0.25	1 / 6 = 0.167	0.083
3	1	0.85	2 / 4 = 0.50	1 / 6 = 0.167	0.333
4	0	0.80	2 / 4 = 0.50	2 / 6 = 0.333	0.167
5	0	0.70	2 / 4 = 0.50	3 / 6 = 0.500	0.00
6	1	0.60	3 / 4 = 0.75	3 / 6 = 0.500	0.25
7	0	0.40	3 / 4 = 0.75	4 / 6 = 0.667	0.083
8	0	0.30	3 / 4 = 0.75	5 / 6 = 0.833	-0.083
9	1	0.20	4 / 4 = 1.00	5 / 6 = 0.833	0.167
10	0	0.10	4 / 4 = 1.00	6 / 6 = 1.000	0.00

✅ Step 2: Identify KS

Look for the maximum difference between:

(Cumulative positives) — % of defaulters seen so far
(Cumulative negatives) — % of non-defaulters seen so far

The maximum value in the last column ((Cumulative positives%) - (Cumulative negatives %)) is:

\boxed{0.333} \text{ at Rank 3 (score = 0.85)}

🔍 Interpretation:

KS = 0.333 → The maximum separation between defaulters and non-defaulters occurs when the score threshold is around 0.85
At that point:
- You've captured 50% of defaulters
- Only 16.7% of non-defaulters
This is the optimal score threshold for maximum model discrimination

KS Statistic

The KS (Kolmogorov-Smirnov) Statistic is a powerful and commonly used evaluation metric for binary classification models, especially in finance, credit scoring, and risk modeling.

📊 What is KS Statistic?

The KS statistic measures the maximum difference between the cumulative distribution functions (CDFs) of the predicted scores for the positive class (events) and negative class (non-events).

Formula:

KS = \max_x |F_1(x) - F_0(x)|

Where:

$F_1(x)$ : Cumulative distribution of positive class (e.g., default)
$F_0(x)$ : Cumulative distribution of negative class (e.g., non-default)

🧠 Intuition:

It tells how well the model separates the two classes.
A higher KS value means better separation of good and bad cases.
KS = 0: no separation (useless model)
KS = 1: perfect separation (ideal but unrealistic)

📌 Usage by Domain

Domain	Why KS is Used
Banking / Credit Risk	Industry standard for measuring discriminatory power between defaulters and non-defaulters
Insurance	Distinguishing claimants vs non-claimants
Fraud Detection	Separating fraudulent from legitimate transactions
Marketing	Used less commonly; better suited metrics include precision@k and lift

✅ Typical KS Value Interpretation:

KS Score	Model Quality
< 0.2	Poor
0.2 - 0.3	Fair
0.3 - 0.4	Good
> 0.4	Excellent