Monday, August 11, 2025

Covariance

Significance of covariance & meaning of high vs low values

Covariance measures how two features vary together:

Positive covariance → When feature $F_1$ is above its mean, feature $F_2$ tends to also be above its mean. (They move in the same direction.)
Negative covariance → When $F_1$ is above its mean, $F_2$ tends to be below its mean. (They move in opposite directions.)
Near zero covariance → No consistent linear relationship — knowing one feature’s deviation from the mean tells you nothing about the other.

Numerically:

Large magnitude (positive or negative) means strong linear relationship.
Small magnitude means weak or no relationship.

Why covariance matters in PCA

The covariance matrix encodes all pairwise relationships between features.
If features are highly correlated (large positive or negative covariance), PCA will combine them into a principal component that captures their shared variation, so you don’t have redundancy.
If covariances are near zero, features are largely independent; PCA will mostly keep them separate unless variances are drastically different.

💡 Analogy
Think of the covariance matrix as a “map” of how all features move together.
The eigenvectors are “routes” through this map that maximize variance.
PCA rotates your view to look along those routes, and you project the original data (not the covariance matrix itself) into that rotated view.

Sunday, August 10, 2025

How R2 is different from RMSE, MAE

Let’s unpack this step-by-step — because R², RMSE, and MAE all involve “difference between actual and predicted,” but they measure different things and answer different questions.

1. R² (Coefficient of Determination) — Variance Captured

Think of your target values (y_actual) as having some spread (variance) around their mean.
If you didn’t have a model and just guessed the mean for everyone, that’s your baseline.
R² asks:

"How much better is my model compared to just guessing the mean every time?"

Formula

R^2 = 1 - \frac{\text{Sum of Squared Errors of Model}}{\text{Sum of Squared Errors of Mean Model}}

Where:

SSE_model = Σ(Actual − Predicted)²
SSE_mean = Σ(Actual − Mean)²

Intuition

R² = 1.0 → Model perfectly predicts all values (100% of variance explained).
R² = 0.0 → Model is no better than guessing the mean.
R² < 0.0 → Model is worse than guessing the mean (ouch).

Example in Credit Limit Prediction

Let’s say actual limits for 5 customers are:

Actual:  10k, 12k, 15k, 20k, 25k
Mean:    16.4k

Variance is the spread around 16.4k.

Case A: Terrible Model

Predicted: 16.4k for everyone (mean model) →
SSE_model = SSE_mean → R² = 0.

Case B: Decent Model

Predicted: 9k, 13k, 14k, 21k, 26k →
SSE_model is much smaller than SSE_mean → R² ≈ 0.85.
This means the model explains 85% of the variation in limits between customers.

2. RMSE & MAE — Error Magnitude

These do not compare to a baseline — they tell you how far off predictions are, on average.
RMSE penalizes large mistakes more heavily than MAE (because it squares the errors before averaging).
Both are absolute accuracy metrics, not relative to variance.

Example with Same Data

If predictions are:

Actual:    10k, 12k, 15k, 20k, 25k
Predicted: 9k, 13k, 14k, 21k, 26k

Errors: 1k, 1k, 1k, 1k, 1k

MAE = (1k + 1k + 1k + 1k + 1k) / 5 = 1k
RMSE = sqrt((1² + 1² + 1² + 1² + 1²) / 5) = 1k
R² = very high, because variance explained is high.

Key Difference

R²: “How much of the pattern in the data did I capture?”
RMSE / MAE: “How far off am I, in the actual unit (e.g., $)?”

You can have:

High R² but high RMSE → You’re good at ranking & trend, but still making large dollar errors.
Low R² but low RMSE → Everyone gets about the same prediction, close to average, but model doesn’t capture much variation between people.

📘 Regression Model Evaluation: Credit Limit Assignment

🎯 The Scenario

You're building an XGBoost regression model that predicts:
"How much credit can we safely give this loan applicant?"
Input: Application data, financial history, behavior data Output (y): A dollar amount (e.g., $10,000) — this is continuous, not a yes/no
Since we're predicting a number (not a class), we can't use AUC, KS, or accuracy. We need regression-specific metrics.

1️⃣ The Core Metrics: How to Measure "How Wrong" the Model Is

Think of these as different ways to ask: "How far off were our predictions?"

📊 The Five Key Metrics (Simplified)

Code

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  Predicted: $8,000   Actual: $10,000   Error: -$2,000        │
│                                                              │
│  Different metrics measure this gap differently:             │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Metric	What It Tells You	Easy Example
RMSE	Punishes BIG mistakes more harshly	RMSE = $3,000 → occasional huge errors (like predicting $50K when answer is $10K)
MAE	Average error in dollars	MAE = $2,000 → typical prediction is $2K off
MAPE	Average error as a %	MAPE = 15% → predictions are typically 15% off
R²	% of variation the model explains	R² = 0.65 → model explains 65% of why limits differ
Spearman	How well rankings match (not exact $)	0.8 → applicants ranked high by model usually do get high limits

🔍 Quick Visual: RMSE vs MAE

Code

Actual limits:    $10K, $10K, $10K, $10K
Predictions:      $9K,  $9K,  $9K,  $50K   ← One big mistake!

MAE  = average of |errors|       = $11K (smoothed)
RMSE = sqrt(average of errors²)  = $20K (BIG mistake stands out!)

→ Use RMSE when big errors are costly (like over-lending)
→ Use MAE when you want a fair average

💡 Why Each Metric Matters in Credit Context

Code

┌────────────────────────────────────────────────────────┐
│                                                        │
│  RMSE → "Did we make any DISASTROUS predictions?"      │
│         (Giving $50K to someone who can repay $10K)    │
│                                                        │
│  MAE  → "What's our TYPICAL miss in dollars?"          │
│         (Off by ~$2K on average)                       │
│                                                        │
│  MAPE → "How proportional is our error?"               │
│         ($1K miss on a $5K limit = bigger problem      │
│          than $1K miss on a $50K limit)                │
│                                                        │
│  R²   → "Is our model better than just guessing        │
│          the average?"                                 │
│                                                        │
│  Spearman → "Do we at least rank people correctly?"    │
│             (Even if dollars are slightly off)         │
│                                                        │
└────────────────────────────────────────────────────────┘

2️⃣ What's "Good Enough"? Industry Thresholds

These are typical benchmarks (varies by business, but useful guidelines):
Metric Acceptable Range Why It Matters
RMSE ≤ 15–25% of credit range Prevents catastrophic over-lending
MAE ≤ 10–20% of credit range Keeps typical errors within tolerance
MAPE ≤ 30–40% Ensures fairness across small & large limits
R² ≥ 0.5 (≥ 0.4 for noisy data) Model meaningfully explains differences
Spearman ≥ 0.6–0.7 Critical for ranking applicants into tiers

Metric	Acceptable Range	Why It Matters
RMSE	≤ 15–25% of credit range	Prevents catastrophic over-lending
MAE	≤ 10–20% of credit range	Keeps typical errors within tolerance
MAPE	≤ 30–40%	Ensures fairness across small & large limits
R²	≥ 0.5 (≥ 0.4 for noisy data)	Model meaningfully explains differences
Spearman	≥ 0.6–0.7	Critical for ranking applicants into tiers

📌 Example Sanity Check

Code

Credit limits range from $0 to $50,000

✅ RMSE = $3,000   →  6% of range  → GOOD
✅ MAE  = $2,000   →  4% of range  → GREAT
⚠️ MAPE = 35%      →  Borderline   → OK
✅ R²   = 0.65     →  65% explained → GOOD  
✅ Spearman = 0.75 →  Rankings solid → GOOD

Model status: APPROVED ✅

3️⃣ Why Rank Ordering Often Matters MORE Than Exact $$

Here's a crucial insight that surprises many people:
In credit risk, the EXACT predicted dollar amount usually gets adjusted by business rules anyway. What matters most is the RANKING — does the model correctly order applicants from low-risk to high-risk?

🎯 Visual Example

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  Applicants and Model Predictions:                       │
│                                                          │
│  Person A: Model predicts $8K, gets approved for $7K    │
│  Person B: Model predicts $15K, gets approved for $14K  │
│  Person C: Model predicts $25K, gets approved for $20K  │
│                                                          │
│  ✅ Exact $ may differ from approvals                    │
│  ✅ BUT the ORDER (A < B < C) is correct                 │
│  ✅ Business policy makes final $ adjustments            │
│                                                          │
│  → Spearman rank correlation captures this perfectly     │
│                                                          │
└──────────────────────────────────────────────────────────┘

Why This Matters for Regulators

Approval tiers (Tier 1, 2, 3) depend on relative ranking
Spearman correlation is often a regulatory requirement
A model with slightly worse RMSE but better Spearman is often preferred ✅

4️⃣ Stability Checks: Does the Model Hold Up Over Time?

Accuracy on test data isn't enough. You need to know:
"Will this model still work in 6 months when customer behavior shifts?"

Two Critical Stability Tests

🔄 PSI (Population Stability Index)

What it checks: Are your input features behaving similarly now vs when you trained?

Code

Training time:     Today (6 months later):
                                          
Income distribution:    Income distribution:
$30K-$50K: 40%         $30K-$50K: 25%  ← SHIFTED!
$50K-$80K: 40%         $50K-$80K: 35%
$80K+:     20%         $80K+:     40%

PSI = high → DANGER!
Population has changed → model may fail

PSI Score	Interpretation
< 0.10	✅ Stable — no action needed
0.10 – 0.25	⚠️ Slight shift — monitor
> 0.25	🚨 Major shift — retrain model

For regression: Apply PSI on binned features, not on the target value.

🎯 CSI (Characteristic Stability Index)

What it checks: Is the relationship between a feature and the target stable?

Code

Example: "Does income still predict credit limit the same way?"

Training: Income $50K → avg limit $15K
Today:    Income $50K → avg limit $12K  ← Relationship shifted!

For regression:

Bin the feature into groups
Calculate mean target per bin (training vs production)
Compare distributions

🛠️ Plus the Standard Diagnostics

Tool Purpose
Feature Importance (XGBoost gain/cover) Which features drive predictions?
SHAP values Explain individual predictions to regulators

Tool	Purpose
Feature Importance (XGBoost gain/cover)	Which features drive predictions?
SHAP values	Explain individual predictions to regulators

5️⃣ The Final Approval Decision

✅ Strong Approval Case

Code

┌─────────────────────────────────────────────┐
│  ✅ RMSE/MAE/MAPE within targets             │
│  ✅ R² ≥ 0.5                                 │
│  ✅ Spearman ≥ 0.65                          │
│  ✅ PSI < 0.10 (stable population)           │
│  ✅ CSI < 0.10 (stable relationships)        │
│                                             │
│  → APPROVED with confidence ✅               │
└─────────────────────────────────────────────┘

⚠️ Borderline Cases (Still Approvable)

A model can be approved even if one metric is weak, IF:
✅ It beats the current champion model
✅ It's more stable over time
✅ It's more explainable for regulators
✅ It's policy-compliant

🚫 Likely Rejection

Code

❌ Multiple accuracy metrics fail
❌ AND stability checks fail
→ High risk → REJECT or rebuild

🎯 The "Remember Forever" Cheat Sheet

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  📌 ACCURACY METRICS (How wrong are predictions?)        │
│     • RMSE → punishes big errors                         │
│     • MAE  → average error in $                          │
│     • MAPE → average error in %                          │
│     • R²   → % of variance explained                     │
│     • Spearman → ranking accuracy                        │
│                                                          │
│  📌 STABILITY METRICS (Will model still work later?)     │
│     • PSI → are features stable over time?               │
│     • CSI → are feature-target relationships stable?     │
│                                                          │
│  📌 EXPLAINABILITY (Can we justify to regulators?)       │
│     • Feature Importance → which features matter?        │
│     • SHAP → why this specific prediction?               │
│                                                          │
│  📌 KEY INSIGHT                                          │
│     Ranking > Exact $ Value                              │
│     (Policy rules adjust amounts anyway)                 │
│                                                          │
└──────────────────────────────────────────────────────────┘

💻 Quick Code Reference

regression_evaluation.py

import numpy as np
from sklearn.metrics import (
    mean_squared_error, 
    mean_absolute_error,
    mean_absolute_percentage_error,
    r2_score

Sample Output:

Code

RMSE:     $3,000
MAE:      $2,000
MAPE:     15%
R²:       0.65
Spearman: 0.75

Evaluation techniques for regression

let’s make this concrete with credit engine–style examples so the intuition clicks.

Imagine we’re predicting Loss Given Default (LGD) for loans.
We have actual LGD values from historical defaults and model predictions.

1. RMSE – Root Mean Squared Error

Example:

Actual LGD: [0.10, 0.30, 0.90]
Predicted LGD: [0.12, 0.50, 0.30]

Errors: [0.02, 0.20, -0.60] → squared: [0.0004, 0.04, 0.36] → mean: 0.1335 → sqrt: 0.365

Interpretation:

The 0.60 miss on the last loan blows up the RMSE because squaring makes big mistakes shout louder.
RMSE here says: “Your typical big-mistake-weighted error is 36.5 percentage points.”

2. MAE – Mean Absolute Error

Same example:
Absolute errors: [0.02, 0.20, 0.60] → mean: 0.273

Interpretation:

MAE says: “On average, you’re off by 27.3 percentage points.”
It treats the 0.60 miss the same way as a smaller one, proportionally.

3. MAPE – Mean Absolute Percentage Error

Example:
Absolute % errors:
[0.02/0.10 = 20%, 0.20/0.30 ≈ 66.7%, 0.60/0.90 ≈ 66.7%] → mean ≈ 51.1%

Interpretation:

“On average, you’re off by 51% of the actual LGD value.”
If an actual LGD is close to zero (e.g., 0.01) and you predict 0.10, MAPE goes crazy.

4. R² – Coefficient of Determination

Example:

If actual LGDs vary a lot, and your predictions capture that variation well, R² will be high.
If you just predict the average LGD for everyone, R² might be close to 0 — you didn’t explain any variance.

Interpretation:

R² answers: “How much of the LGD variability did I explain compared to just guessing the mean?”

5. Pearson vs. Spearman Correlation

Example:

Suppose actual LGD ranking (highest to lowest risk): Loan C, Loan B, Loan A.
Predictions: Loan C is still ranked highest, then A, then B.

Pearson: Could be low if the exact values are off (linear mismatch).
Spearman: Could still be high because the order is mostly right.

Interpretation:

Pearson: “Do the numbers line up in a straight-line way?”
Spearman: “Even if I got magnitudes wrong, did I keep the order right?”

Summary in credit engine terms:

RMSE → penalizes big LGD prediction errors heavily (good if big misses are expensive for the bank).
MAE → gives equal weight to all misses, good for stable reporting.
MAPE → interprets error in % terms, useful if LGD has consistent scale across products.
R² → tells if your model adds value beyond a dumb constant guess.
Spearman → good for prioritization tasks (e.g., which borrowers to monitor first).

IV, PSI, CSI - differences

let’s frame this in a churn prediction context, because that’s a very common case where people see IV, PSI, and CSI all being used, notice that the formulas look similar, but get confused about why they’re treated differently.

1️⃣ The setting — churn prediction

Target: churn_flag (1 = churned, 0 = stayed).
Feature: avg_monthly_usage (average minutes per month).
Goal: Build a model that predicts churn, and also monitor if the feature is stable over time.

We have:

Train set → Customers from Jan–Mar 2025
OOT1 → Customers from Apr 2025
OOT2 → Customers from May 2025

2️⃣ The same base formula — different contexts

The mathematical core of IV, PSI, and CSI is a weighted log ratio:

\text{metric} = \sum (\text{fraction diff}) \times \log \left( \frac{\text{fraction 1}}{\text{fraction 2}} \right)

The difference is what those “fractions” mean and which datasets are compared.

3️⃣ Information Value (IV)

Question: Does this feature separate churners from non-churners in a single dataset?
Fractions:
- $p_{\text{stay, bin}}$ = fraction of stayers in that bin (within train set)
- $p_{\text{churn, bin}}$ = fraction of churners in that bin (within train set)
Data involved: Only one dataset (e.g., Train).
Use: Feature selection — keep features with high IV (e.g., > 0.02).

Example:

Train:
Low usage: 80% churn, 20% stay
High usage: 10% churn, 90% stay

This produces a high IV → strong predictive power.

4️⃣ Population Stability Index (PSI)

Question: Has the overall feature distribution shifted over time? (no target involved)
Fractions:
- $p_{\text{bin, train}}$ = proportion of customers in that bin in Train (all customers, churned or not)
- $p_{\text{bin, OOT}}$ = proportion of customers in that bin in OOT (all customers, churned or not)
Data involved: Two datasets (e.g., Train vs OOT1).
Use: Detect population drift — if customers’ usage patterns shift, even if churn rate doesn’t change.

Example:

Train:
Low usage: 30% of all customers
High usage: 70%

OOT1:
Low usage: 50% of all customers
High usage: 50%

PSI will be high → customer base composition shifted (maybe more low-usage customers now).

5️⃣ Characteristic Stability Index (CSI)

Question: Has the relationship between the feature and the target changed over time? (concept drift)
Fractions:
- $\text{event\_frac}_{A, \text{bin}}$ = proportion of churners in Train that fall into that bin
- $\text{event\_frac}_{B, \text{bin}}$ = proportion of churners in OOT that fall into that bin
Data involved: Two datasets (Train vs OOT1), target-specific.
Use: Detect changes in target–feature relationship.

Example:

Train churners:
Low usage: 70% of churners
High usage: 30% of churners

OOT1 churners:
Low usage: 50% of churners
High usage: 50%

CSI will be high → churn pattern shifted; low usage no longer dominates churn.

6️⃣ Why they differ even if formula looks same

The formula structure is the same because all three are distribution comparison measures (based on KL divergence-like logic).
But the inputs differ:

IV → compares good vs bad within one dataset.
PSI → compares overall feature distribution across datasets.
CSI → compares event-specific feature distribution across datasets.

That’s why in churn:

A feature can have high IV, low PSI, low CSI → predictive and stable.
Or high IV, high PSI → predictive, but customer profile is shifting (risk for model drift).
Or high IV, high CSI → predictive in train, but churn relationship is changing (concept drift).

Saturday, August 9, 2025

Variance Inflation Factor (VIF)

Understanding Variance Inflation Factor (VIF) — An Intuitive Guide

What is VIF?

The Variance Inflation Factor (VIF) is a measure that indicates how much a predictor variable is correlated with other predictors in your dataset. It’s a key tool for detecting multicollinearity—a condition where predictors are highly correlated, potentially causing instability in regression models.

Why Multicollinearity Matters

When predictors overlap in the information they provide:

The model struggles to determine which feature is truly influencing the target.
Coefficient estimates can become unstable and unreliable.
Interpretability suffers, making it harder to trust the model.

How VIF is Calculated (Intuitively)

Choose a predictor variable (e.g., X₁).
Regress X₁ against all the other predictors (X₂, X₃, …, Xₙ).
- Essentially: “Can the other features predict X₁?”
Calculate R² — the proportion of variance in X₁ explained by the others.
Apply the formula:
```
VIF = 1 / (1 - R²)
```
- Low R² → Denominator close to 1 → VIF ≈ 1 (low correlation).
- High R² → Denominator small → VIF large (high correlation).

How to Interpret VIF Values

VIF Value	Meaning
1	No correlation with other features (ideal)
< 5	Acceptable
5–10	Moderate to high correlation — monitor closely
> 10	Severe multicollinearity — problematic

An Intuitive Example

Imagine two features: height and leg length.
If leg length is almost always a fixed fraction of height:

Regressing leg length on height would yield a very high R².
The VIF for leg length would be large, signaling redundancy.

Purpose of VIF in Modeling

Identifies redundant predictors.
Helps decide whether to drop or combine correlated features.
Improves model stability and interpretability.

Key Takeaway

Question VIF answers: “Can I predict this feature using the others?”
High VIF: Strong multicollinearity → unstable estimates.
Low VIF: Predictors are relatively independent → better modeling performance.

Understanding R² and VIF — From Model Fit to Multicollinearity

When building regression models, two concepts often come up together: R² (R-squared) and VIF (Variance Inflation Factor).
One measures how well your model fits the data, while the other checks for redundancy between predictors.
Let’s break them down intuitively and see how they connect.

1. What is R²?

R² — also known as the coefficient of determination — tells you how well your model’s predictions match the actual data.

R² = 1 → Perfect fit (model predictions match data exactly)
R² = 0 → Model explains none of the variation (as good as predicting the mean)
R² < 0 → Worse than just predicting the mean

What R² Really Measures

It represents the proportion of variance in the target variable explained by the model.
For example:

R² = 0.70 → 70% of the target’s variation is explained by the predictors.

How It’s Calculated

$R² = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$

Where:

SS_res = Sum of squared residuals (errors between actual & predicted)
SS_tot = Total sum of squares (variance of actual values from the mean)

R² Interpretation Table

R² Value	Meaning
1	Perfect prediction
0.7	Explains 70% of variance
0	No predictive power
< 0	Worse than mean prediction

2. How R² Relates to VIF

The Variance Inflation Factor (VIF) uses R² behind the scenes to detect multicollinearity — when predictors are highly correlated with each other.

For each predictor, we run a regression of that predictor on all the other predictors.
We calculate R² for that regression.
VIF is then:

$\text{VIF} = \frac{1}{1 - R²}$

High R² ⇒ High VIF ⇒ High multicollinearity

3. Step-by-Step VIF Example

Imagine we have three predictors:

Height	Weight	Leg_Length
160	60	80
170	70	85
180	80	90
175	75	88
165	65	83

Let’s calculate VIF for Weight.

Step 1: Regress “Weight” on the Other Predictors

We fit:

$Weight = a + b1*Height + b2*Leg_Length + error$

Step 2: Calculate R²

Suppose we get R² = 0.95 — meaning Height and Leg_Length together explain 95% of Weight’s variance.

Step 3: Compute VIF

$\text{VIF} = \frac{1}{1 - 0.95} = \frac{1}{0.05} = 20$

Interpretation: VIF of 20 is extremely high — Weight is almost redundant given the other two predictors.

VIF Summary Table

Variable	R² with others	VIF	Multicollinearity?
Height	0.80	5	Moderate
Weight	0.95	20	Severe
Leg_Length	0.70	3.33	Low/Moderate

4. Python Example

import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Example Data
df = pd.DataFrame({
    'Height': [160, 170, 180, 175, 165],
    'Weight': [60, 70, 80, 75, 65],
    'Leg_Length': [80, 85, 90, 88, 83]
})

# Calculate VIF
X = df[['Height', 'Weight', 'Leg_Length']]
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

Output:

     feature     VIF
0     Height   6.12
1     Weight  20.34
2  Leg_Length  4.78

Weight clearly has problematic multicollinearity.

5. Key Takeaways

R²: Measures model fit — how much of the target’s variance is explained.
VIF: Uses R² to check feature redundancy.
High VIF (>10): Signals severe multicollinearity; consider removing or combining features.

Monday, August 11, 2025

Covariance

Significance of covariance & meaning of high vs low values

Why covariance matters in PCA

Sunday, August 10, 2025

How R2 is different from RMSE, MAE

1. R² (Coefficient of Determination) — Variance Captured

Formula

Intuition

Example in Credit Limit Prediction

Case A: Terrible Model

Case B: Decent Model

2. RMSE & MAE — Error Magnitude

Example with Same Data

Key Difference

📘 Regression Model Evaluation: Credit Limit Assignment

🎯 The Scenario

1️⃣ The Core Metrics: How to Measure "How Wrong" the Model Is

Think of these as different ways to ask: "How far off were our predictions?"

📊 The Five Key Metrics (Simplified)

🔍 Quick Visual: RMSE vs MAE

💡 Why Each Metric Matters in Credit Context

2️⃣ What's "Good Enough"? Industry Thresholds

📌 Example Sanity Check

CodeCredit limits range from $0 to $50,000 ✅ RMSE = $3,000 → 6% of range → GOOD ✅ MAE = $2,000 → 4% of range → GREAT ⚠️ MAPE = 35% → Borderline → OK ✅ R² = 0.65 → 65% explained → GOOD ✅ Spearman = 0.75 → Rankings solid → GOOD Model status: APPROVED ✅

3️⃣ Why Rank Ordering Often Matters MORE Than Exact $$

Here's a crucial insight that surprises many people:In credit risk, the EXACT predicted dollar amount usually gets adjusted by business rules anyway. What matters most is the RANKING — does the model correctly order applicants from low-risk to high-risk?

🎯 Visual Example

Why This Matters for Regulators

Approval tiers (Tier 1, 2, 3) depend on relative rankingSpearman correlation is often a regulatory requirementA model with slightly worse RMSE but better Spearman is often preferred ✅

4️⃣ Stability Checks: Does the Model Hold Up Over Time?

Accuracy on test data isn't enough. You need to know:"Will this model still work in 6 months when customer behavior shifts?"

Two Critical Stability Tests

🔄 PSI (Population Stability Index)

🎯 CSI (Characteristic Stability Index)

🛠️ Plus the Standard Diagnostics

ToolPurposeFeature Importance (XGBoost gain/cover)Which features drive predictions?SHAP valuesExplain individual predictions to regulators

5️⃣ The Final Approval Decision

✅ Strong Approval Case

⚠️ Borderline Cases (Still Approvable)

A model can be approved even if one metric is weak, IF:✅ It beats the current champion model✅ It's more stable over time✅ It's more explainable for regulators✅ It's policy-compliant

🚫 Likely Rejection

Code❌ Multiple accuracy metrics fail ❌ AND stability checks fail → High risk → REJECT or rebuild

🎯 The "Remember Forever" Cheat Sheet

💻 Quick Code Reference

regression_evaluation.pyimport numpy as np from sklearn.metrics import ( mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score Sample Output:CodeRMSE: $3,000 MAE: $2,000 MAPE: 15% R²: 0.65 Spearman: 0.75

Evaluation techniques for regression

1. RMSE – Root Mean Squared Error

2. MAE – Mean Absolute Error

3. MAPE – Mean Absolute Percentage Error

4. R² – Coefficient of Determination

5. Pearson vs. Spearman Correlation

Summary in credit engine terms:

IV, PSI, CSI - differences

1️⃣ The setting — churn prediction

2️⃣ The same base formula — different contexts

3️⃣ Information Value (IV)

4️⃣ Population Stability Index (PSI)

5️⃣ Characteristic Stability Index (CSI)

6️⃣ Why they differ even if formula looks same

Saturday, August 9, 2025

Variance Inflation Factor (VIF)

Understanding Variance Inflation Factor (VIF) — An Intuitive Guide

What is VIF?

Why Multicollinearity Matters

How VIF is Calculated (Intuitively)

How to Interpret VIF Values

An Intuitive Example

Purpose of VIF in Modeling

Key Takeaway

Understanding R² and VIF — From Model Fit to Multicollinearity

1. What is R²?

What R² Really Measures

How It’s Calculated

R² Interpretation Table

2. How R² Relates to VIF

3. Step-by-Step VIF Example

Step 1: Regress “Weight” on the Other Predictors

Step 2: Calculate R²

Step 3: Compute VIF

Code
`Credit limits range from $0 to $50,000 ✅ RMSE = $3,000 → 6% of range → GOOD ✅ MAE = $2,000 → 4% of range → GREAT ⚠️ MAPE = 35% → Borderline → OK ✅ R² = 0.65 → 65% explained → GOOD ✅ Spearman = 0.75 → Rankings solid → GOOD Model status: APPROVED ✅`

Here's a crucial insight that surprises many people:
In credit risk, the EXACT predicted dollar amount usually gets adjusted by business rules anyway. What matters most is the RANKING — does the model correctly order applicants from low-risk to high-risk?

Approval tiers (Tier 1, 2, 3) depend on relative ranking
Spearman correlation is often a regulatory requirement
A model with slightly worse RMSE but better Spearman is often preferred ✅

Accuracy on test data isn't enough. You need to know:
"Will this model still work in 6 months when customer behavior shifts?"

Tool Purpose
Feature Importance (XGBoost gain/cover) Which features drive predictions?
SHAP values Explain individual predictions to regulators

A model can be approved even if one metric is weak, IF:
✅ It beats the current champion model
✅ It's more stable over time
✅ It's more explainable for regulators
✅ It's policy-compliant

Code
`❌ Multiple accuracy metrics fail ❌ AND stability checks fail → High risk → REJECT or rebuild`

regression_evaluation.py
Sample Output:
Code
`RMSE: $3,000 MAE: $2,000 MAPE: 15% R²: 0.65 Spearman: 0.75`