Monday, August 11, 2025

Covariance



Significance of covariance & meaning of high vs low values

Covariance measures how two features vary together:

  • Positive covariance → When feature F1F_1 is above its mean, feature F2F_2 tends to also be above its mean. (They move in the same direction.)

  • Negative covariance → When F1F_1 is above its mean, F2F_2 tends to be below its mean. (They move in opposite directions.)

  • Near zero covariance → No consistent linear relationship — knowing one feature’s deviation from the mean tells you nothing about the other.

Numerically:

  • Large magnitude (positive or negative) means strong linear relationship.

  • Small magnitude means weak or no relationship.


Why covariance matters in PCA

  • The covariance matrix encodes all pairwise relationships between features.

  • If features are highly correlated (large positive or negative covariance), PCA will combine them into a principal component that captures their shared variation, so you don’t have redundancy.

  • If covariances are near zero, features are largely independent; PCA will mostly keep them separate unless variances are drastically different.


💡 Analogy
Think of the covariance matrix as a “map” of how all features move together.
The eigenvectors are “routes” through this map that maximize variance.
PCA rotates your view to look along those routes, and you project the original data (not the covariance matrix itself) into that rotated view.


Sunday, August 10, 2025

How R2 is different from RMSE, MAE

 Let’s unpack this step-by-step — because , RMSE, and MAE all involve “difference between actual and predicted,” but they measure different things and answer different questions.


1. R² (Coefficient of Determination)Variance Captured

  • Think of your target values (y_actual) as having some spread (variance) around their mean.

  • If you didn’t have a model and just guessed the mean for everyone, that’s your baseline.

  • asks:

    "How much better is my model compared to just guessing the mean every time?"


Formula

R2=1Sum of Squared Errors of ModelSum of Squared Errors of Mean ModelR^2 = 1 - \frac{\text{Sum of Squared Errors of Model}}{\text{Sum of Squared Errors of Mean Model}}

Where:

  • SSE_model = Σ(Actual − Predicted)²

  • SSE_mean = Σ(Actual − Mean)²


Intuition

  • R² = 1.0 → Model perfectly predicts all values (100% of variance explained).

  • R² = 0.0 → Model is no better than guessing the mean.

  • R² < 0.0 → Model is worse than guessing the mean (ouch).


Example in Credit Limit Prediction

Let’s say actual limits for 5 customers are:

Actual:  10k, 12k, 15k, 20k, 25k
Mean:    16.4k

Variance is the spread around 16.4k.

Case A: Terrible Model

Predicted: 16.4k for everyone (mean model) →
SSE_model = SSE_mean → R² = 0.

Case B: Decent Model

Predicted: 9k, 13k, 14k, 21k, 26k
SSE_model is much smaller than SSE_mean → R² ≈ 0.85.
This means the model explains 85% of the variation in limits between customers.


2. RMSE & MAEError Magnitude

  • These do not compare to a baseline — they tell you how far off predictions are, on average.

  • RMSE penalizes large mistakes more heavily than MAE (because it squares the errors before averaging).

  • Both are absolute accuracy metrics, not relative to variance.


Example with Same Data

If predictions are:

Actual:    10k, 12k, 15k, 20k, 25k
Predicted: 9k, 13k, 14k, 21k, 26k

Errors: 1k, 1k, 1k, 1k, 1k

  • MAE = (1k + 1k + 1k + 1k + 1k) / 5 = 1k

  • RMSE = sqrt((1² + 1² + 1² + 1² + 1²) / 5) = 1k

  • = very high, because variance explained is high.


Key Difference

  • : “How much of the pattern in the data did I capture?”

  • RMSE / MAE: “How far off am I, in the actual unit (e.g., $)?”

You can have:

  • High R² but high RMSE → You’re good at ranking & trend, but still making large dollar errors.

  • Low R² but low RMSE → Everyone gets about the same prediction, close to average, but model doesn’t capture much variation between people.



📘 Regression Model Evaluation: Credit Limit Assignment

🎯 The Scenario

You're building an XGBoost regression model that predicts:

"How much credit can we safely give this loan applicant?"

Input: Application data, financial history, behavior data Output (y): A dollar amount (e.g., $10,000) — this is continuous, not a yes/no

Since we're predicting a number (not a class), we can't use AUC, KS, or accuracy. We need regression-specific metrics.


1️⃣ The Core Metrics: How to Measure "How Wrong" the Model Is

Think of these as different ways to ask: "How far off were our predictions?"

📊 The Five Key Metrics (Simplified)

Code
┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  Predicted: $8,000   Actual: $10,000   Error: -$2,000        │
│                                                              │
│  Different metrics measure this gap differently:             │
│                                                              │
└──────────────────────────────────────────────────────────────┘
MetricWhat It Tells YouEasy Example
RMSEPunishes BIG mistakes more harshlyRMSE = $3,000 → occasional huge errors (like predicting $50K when answer is $10K)
MAEAverage error in dollarsMAE = $2,000 → typical prediction is $2K off
MAPEAverage error as a %MAPE = 15% → predictions are typically 15% off
% of variation the model explainsR² = 0.65 → model explains 65% of why limits differ
SpearmanHow well rankings match (not exact $)0.8 → applicants ranked high by model usually do get high limits

🔍 Quick Visual: RMSE vs MAE

Code
Actual limits:    $10K, $10K, $10K, $10K
Predictions:      $9K,  $9K,  $9K,  $50K   ← One big mistake!

MAE  = average of |errors|       = $11K (smoothed)
RMSE = sqrt(average of errors²)  = $20K (BIG mistake stands out!)

→ Use RMSE when big errors are costly (like over-lending)
→ Use MAE when you want a fair average

💡 Why Each Metric Matters in Credit Context

Code
┌────────────────────────────────────────────────────────┐
│                                                        │
│  RMSE → "Did we make any DISASTROUS predictions?"      │
│         (Giving $50K to someone who can repay $10K)    │
│                                                        │
│  MAE  → "What's our TYPICAL miss in dollars?"          │
│         (Off by ~$2K on average)                       │
│                                                        │
│  MAPE → "How proportional is our error?"               │
│         ($1K miss on a $5K limit = bigger problem      │
│          than $1K miss on a $50K limit)                │
│                                                        │
│  R²   → "Is our model better than just guessing        │
│          the average?"                                 │
│                                                        │
│  Spearman → "Do we at least rank people correctly?"    │
│             (Even if dollars are slightly off)         │
│                                                        │
└────────────────────────────────────────────────────────┘

2️⃣ What's "Good Enough"? Industry Thresholds

These are typical benchmarks (varies by business, but useful guidelines):

MetricAcceptable RangeWhy It Matters
RMSE≤ 15–25% of credit rangePrevents catastrophic over-lending
MAE≤ 10–20% of credit rangeKeeps typical errors within tolerance
MAPE≤ 30–40%Ensures fairness across small & large limits
≥ 0.5 (≥ 0.4 for noisy data)Model meaningfully explains differences
Spearman≥ 0.6–0.7Critical for ranking applicants into tiers

📌 Example Sanity Check

Code
Credit limits range from $0 to $50,000

✅ RMSE = $3,000   →  6% of range  → GOOD
✅ MAE  = $2,000   →  4% of range  → GREAT
⚠️ MAPE = 35%      →  Borderline   → OK
✅ R²   = 0.65     →  65% explained → GOOD  
✅ Spearman = 0.75 →  Rankings solid → GOOD

Model status: APPROVED ✅

3️⃣ Why Rank Ordering Often Matters MORE Than Exact $$

Here's a crucial insight that surprises many people:

In credit risk, the EXACT predicted dollar amount usually gets adjusted by business rules anyway. What matters most is the RANKING — does the model correctly order applicants from low-risk to high-risk?

🎯 Visual Example

Code
┌──────────────────────────────────────────────────────────┐
│                                                          │
│  Applicants and Model Predictions:                       │
│                                                          │
│  Person A: Model predicts $8K, gets approved for $7K    │
│  Person B: Model predicts $15K, gets approved for $14K  │
│  Person C: Model predicts $25K, gets approved for $20K  │
│                                                          │
│  ✅ Exact $ may differ from approvals                    │
│  ✅ BUT the ORDER (A < B < C) is correct                 │
│  ✅ Business policy makes final $ adjustments            │
│                                                          │
│  → Spearman rank correlation captures this perfectly     │
│                                                          │
└──────────────────────────────────────────────────────────┘

Why This Matters for Regulators

  • Approval tiers (Tier 1, 2, 3) depend on relative ranking
  • Spearman correlation is often a regulatory requirement
  • A model with slightly worse RMSE but better Spearman is often preferred ✅

4️⃣ Stability Checks: Does the Model Hold Up Over Time?

Accuracy on test data isn't enough. You need to know:

"Will this model still work in 6 months when customer behavior shifts?"

Two Critical Stability Tests

🔄 PSI (Population Stability Index)

What it checks: Are your input features behaving similarly now vs when you trained?

Code
Training time:     Today (6 months later):
                                          
Income distribution:    Income distribution:
$30K-$50K: 40%         $30K-$50K: 25%  ← SHIFTED!
$50K-$80K: 40%         $50K-$80K: 35%
$80K+:     20%         $80K+:     40%

PSI = high → DANGER!
Population has changed → model may fail
PSI ScoreInterpretation
< 0.10✅ Stable — no action needed
0.10 – 0.25⚠️ Slight shift — monitor
> 0.25🚨 Major shift — retrain model

For regression: Apply PSI on binned features, not on the target value.

🎯 CSI (Characteristic Stability Index)

What it checks: Is the relationship between a feature and the target stable?

Code
Example: "Does income still predict credit limit the same way?"

Training: Income $50K → avg limit $15K
Today:    Income $50K → avg limit $12K  ← Relationship shifted!

For regression:

  1. Bin the feature into groups
  2. Calculate mean target per bin (training vs production)
  3. Compare distributions

🛠️ Plus the Standard Diagnostics

ToolPurpose
Feature Importance (XGBoost gain/cover)Which features drive predictions?
SHAP valuesExplain individual predictions to regulators

5️⃣ The Final Approval Decision

✅ Strong Approval Case

Code
┌─────────────────────────────────────────────┐
│  ✅ RMSE/MAE/MAPE within targets             │
│  ✅ R² ≥ 0.5                                 │
│  ✅ Spearman ≥ 0.65                          │
│  ✅ PSI < 0.10 (stable population)           │
│  ✅ CSI < 0.10 (stable relationships)        │
│                                             │
│  → APPROVED with confidence ✅               │
└─────────────────────────────────────────────┘

⚠️ Borderline Cases (Still Approvable)

A model can be approved even if one metric is weak, IF:

  • ✅ It beats the current champion model
  • ✅ It's more stable over time
  • ✅ It's more explainable for regulators
  • ✅ It's policy-compliant

🚫 Likely Rejection

Code
❌ Multiple accuracy metrics fail
❌ AND stability checks fail
→ High risk → REJECT or rebuild

🎯 The "Remember Forever" Cheat Sheet

Code
┌──────────────────────────────────────────────────────────┐
│                                                          │
│  📌 ACCURACY METRICS (How wrong are predictions?)        │
│     • RMSE → punishes big errors                         │
│     • MAE  → average error in $                          │
│     • MAPE → average error in %                          │
│     • R²   → % of variance explained                     │
│     • Spearman → ranking accuracy                        │
│                                                          │
│  📌 STABILITY METRICS (Will model still work later?)     │
│     • PSI → are features stable over time?               │
│     • CSI → are feature-target relationships stable?     │
│                                                          │
│  📌 EXPLAINABILITY (Can we justify to regulators?)       │
│     • Feature Importance → which features matter?        │
│     • SHAP → why this specific prediction?               │
│                                                          │
│  📌 KEY INSIGHT                                          │
│     Ranking > Exact $ Value                              │
│     (Policy rules adjust amounts anyway)                 │
│                                                          │
└──────────────────────────────────────────────────────────┘

💻 Quick Code Reference

regression_evaluation.py

Sample Output:

Code
RMSE:     $3,000
MAE:      $2,000
MAPE:     15%
R²:       0.65
Spearman: 0.75


Evaluation techniques for regression

let’s make this concrete with credit engine–style examples so the intuition clicks.

Imagine we’re predicting Loss Given Default (LGD) for loans.
We have actual LGD values from historical defaults and model predictions.


1. RMSE – Root Mean Squared Error

Example:

  • Actual LGD: [0.10, 0.30, 0.90]

  • Predicted LGD: [0.12, 0.50, 0.30]

Errors: [0.02, 0.20, -0.60] → squared: [0.0004, 0.04, 0.36] → mean: 0.1335 → sqrt: 0.365

Interpretation:

  • The 0.60 miss on the last loan blows up the RMSE because squaring makes big mistakes shout louder.

  • RMSE here says: “Your typical big-mistake-weighted error is 36.5 percentage points.”


2. MAE – Mean Absolute Error

Same example:
Absolute errors: [0.02, 0.20, 0.60] → mean: 0.273

Interpretation:

  • MAE says: “On average, you’re off by 27.3 percentage points.”

  • It treats the 0.60 miss the same way as a smaller one, proportionally.


3. MAPE – Mean Absolute Percentage Error

Example:
Absolute % errors:
[0.02/0.10 = 20%, 0.20/0.30 ≈ 66.7%, 0.60/0.90 ≈ 66.7%] → mean ≈ 51.1%

Interpretation:

  • “On average, you’re off by 51% of the actual LGD value.”

  • If an actual LGD is close to zero (e.g., 0.01) and you predict 0.10, MAPE goes crazy.


4. R² – Coefficient of Determination

Example:

  • If actual LGDs vary a lot, and your predictions capture that variation well, R² will be high.

  • If you just predict the average LGD for everyone, R² might be close to 0 — you didn’t explain any variance.

Interpretation:

  • R² answers: “How much of the LGD variability did I explain compared to just guessing the mean?”


5. Pearson vs. Spearman Correlation

Example:

  • Suppose actual LGD ranking (highest to lowest risk): Loan C, Loan B, Loan A.

  • Predictions: Loan C is still ranked highest, then A, then B.

Pearson: Could be low if the exact values are off (linear mismatch).
Spearman: Could still be high because the order is mostly right.

Interpretation:

  • Pearson: “Do the numbers line up in a straight-line way?”

  • Spearman: “Even if I got magnitudes wrong, did I keep the order right?”


Summary in credit engine terms:

  • RMSE → penalizes big LGD prediction errors heavily (good if big misses are expensive for the bank).

  • MAE → gives equal weight to all misses, good for stable reporting.

  • MAPE → interprets error in % terms, useful if LGD has consistent scale across products.

  • → tells if your model adds value beyond a dumb constant guess.

  • Spearman → good for prioritization tasks (e.g., which borrowers to monitor first).



IV, PSI, CSI - differences

let’s frame this in a churn prediction context, because that’s a very common case where people see IV, PSI, and CSI all being used, notice that the formulas look similar, but get confused about why they’re treated differently.


1️⃣ The setting — churn prediction

  • Target: churn_flag (1 = churned, 0 = stayed).

  • Feature: avg_monthly_usage (average minutes per month).

  • Goal: Build a model that predicts churn, and also monitor if the feature is stable over time.

We have:

  • Train set → Customers from Jan–Mar 2025

  • OOT1 → Customers from Apr 2025

  • OOT2 → Customers from May 2025


2️⃣ The same base formula — different contexts

The mathematical core of IV, PSI, and CSI is a weighted log ratio:

metric=(fraction diff)×log(fraction 1fraction 2)\text{metric} = \sum (\text{fraction diff}) \times \log \left( \frac{\text{fraction 1}}{\text{fraction 2}} \right)

The difference is what those “fractions” mean and which datasets are compared.


3️⃣ Information Value (IV)

  • Question: Does this feature separate churners from non-churners in a single dataset?

  • Fractions:

    • pstay, binp_{\text{stay, bin}} = fraction of stayers in that bin (within train set)

    • pchurn, binp_{\text{churn, bin}} = fraction of churners in that bin (within train set)

  • Data involved: Only one dataset (e.g., Train).

  • Use: Feature selection — keep features with high IV (e.g., > 0.02).

  • Example:

    Train:
    Low usage: 80% churn, 20% stay
    High usage: 10% churn, 90% stay
    

    This produces a high IV → strong predictive power.


4️⃣ Population Stability Index (PSI)

  • Question: Has the overall feature distribution shifted over time? (no target involved)

  • Fractions:

    • pbin, trainp_{\text{bin, train}} = proportion of customers in that bin in Train (all customers, churned or not)

    • pbin, OOTp_{\text{bin, OOT}} = proportion of customers in that bin in OOT (all customers, churned or not)

  • Data involved: Two datasets (e.g., Train vs OOT1).

  • Use: Detect population drift — if customers’ usage patterns shift, even if churn rate doesn’t change.

  • Example:

    Train:
    Low usage: 30% of all customers
    High usage: 70%
    
    OOT1:
    Low usage: 50% of all customers
    High usage: 50%
    

    PSI will be high → customer base composition shifted (maybe more low-usage customers now).


5️⃣ Characteristic Stability Index (CSI)

  • Question: Has the relationship between the feature and the target changed over time? (concept drift)

  • Fractions:

    • event_fracA,bin\text{event\_frac}_{A, \text{bin}} = proportion of churners in Train that fall into that bin

    • event_fracB,bin\text{event\_frac}_{B, \text{bin}} = proportion of churners in OOT that fall into that bin

  • Data involved: Two datasets (Train vs OOT1), target-specific.

  • Use: Detect changes in target–feature relationship.

  • Example:

    Train churners:
    Low usage: 70% of churners
    High usage: 30% of churners
    
    OOT1 churners:
    Low usage: 50% of churners
    High usage: 50%
    

    CSI will be high → churn pattern shifted; low usage no longer dominates churn.


6️⃣ Why they differ even if formula looks same

The formula structure is the same because all three are distribution comparison measures (based on KL divergence-like logic).
But the inputs differ:

  • IV → compares good vs bad within one dataset.

  • PSI → compares overall feature distribution across datasets.

  • CSI → compares event-specific feature distribution across datasets.

That’s why in churn:

  • A feature can have high IV, low PSI, low CSI → predictive and stable.

  • Or high IV, high PSI → predictive, but customer profile is shifting (risk for model drift).

  • Or high IV, high CSI → predictive in train, but churn relationship is changing (concept drift).



Saturday, August 9, 2025

Variance Inflation Factor (VIF)

Understanding Variance Inflation Factor (VIF) — An Intuitive Guide

What is VIF?

The Variance Inflation Factor (VIF) is a measure that indicates how much a predictor variable is correlated with other predictors in your dataset. It’s a key tool for detecting multicollinearity—a condition where predictors are highly correlated, potentially causing instability in regression models.


Why Multicollinearity Matters

When predictors overlap in the information they provide:

  • The model struggles to determine which feature is truly influencing the target.

  • Coefficient estimates can become unstable and unreliable.

  • Interpretability suffers, making it harder to trust the model.


How VIF is Calculated (Intuitively)

  1. Choose a predictor variable (e.g., X₁).

  2. Regress X₁ against all the other predictors (X₂, X₃, …, Xₙ).

    • Essentially: “Can the other features predict X₁?”

  3. Calculate R² — the proportion of variance in X₁ explained by the others.

  4. Apply the formula:

    VIF = 1 / (1 - R²)
    
    • Low R² → Denominator close to 1 → VIF ≈ 1 (low correlation).

    • High R² → Denominator small → VIF large (high correlation).


How to Interpret VIF Values

VIF Value Meaning
1 No correlation with other features (ideal)
< 5 Acceptable
5–10 Moderate to high correlation — monitor closely
> 10 Severe multicollinearity — problematic

An Intuitive Example

Imagine two features: height and leg length.
If leg length is almost always a fixed fraction of height:

  • Regressing leg length on height would yield a very high R².

  • The VIF for leg length would be large, signaling redundancy.


Purpose of VIF in Modeling

  • Identifies redundant predictors.

  • Helps decide whether to drop or combine correlated features.

  • Improves model stability and interpretability.


Key Takeaway

  • Question VIF answers: “Can I predict this feature using the others?”

  • High VIF: Strong multicollinearity → unstable estimates.

  • Low VIF: Predictors are relatively independent → better modeling performance.



Understanding R² and VIF — From Model Fit to Multicollinearity

When building regression models, two concepts often come up together: R² (R-squared) and VIF (Variance Inflation Factor).
One measures how well your model fits the data, while the other checks for redundancy between predictors.
Let’s break them down intuitively and see how they connect.


1. What is R²?

— also known as the coefficient of determination — tells you how well your model’s predictions match the actual data.

  • R² = 1 → Perfect fit (model predictions match data exactly)

  • R² = 0 → Model explains none of the variation (as good as predicting the mean)

  • R² < 0 → Worse than just predicting the mean

What R² Really Measures

It represents the proportion of variance in the target variable explained by the model.
For example:

  • R² = 0.70 → 70% of the target’s variation is explained by the predictors.

How It’s Calculated

R2=1SSresSStotR² = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}

Where:

  • SS_res = Sum of squared residuals (errors between actual & predicted)

  • SS_tot = Total sum of squares (variance of actual values from the mean)


R² Interpretation Table

R² Value Meaning
1 Perfect prediction
0.7 Explains 70% of variance
0 No predictive power
< 0 Worse than mean prediction

2. How R² Relates to VIF

The Variance Inflation Factor (VIF) uses R² behind the scenes to detect multicollinearity — when predictors are highly correlated with each other.

  • For each predictor, we run a regression of that predictor on all the other predictors.

  • We calculate R² for that regression.

  • VIF is then:

VIF=11R2\text{VIF} = \frac{1}{1 - R²}

High R² ⇒ High VIF ⇒ High multicollinearity


3. Step-by-Step VIF Example

Imagine we have three predictors:

Height Weight Leg_Length
160 60 80
170 70 85
180 80 90
175 75 88
165 65 83

Let’s calculate VIF for Weight.


Step 1: Regress “Weight” on the Other Predictors

We fit:

Weight = a + b1*Height + b2*Leg_Length + error


Step 2: Calculate R²

Suppose we get R² = 0.95 — meaning Height and Leg_Length together explain 95% of Weight’s variance.


Step 3: Compute VIF

VIF=110.95=10.05=20\text{VIF} = \frac{1}{1 - 0.95} = \frac{1}{0.05} = 20

Interpretation: VIF of 20 is extremely high — Weight is almost redundant given the other two predictors.


VIF Summary Table

Variable R² with others VIF Multicollinearity?
Height 0.80 5 Moderate
Weight 0.95 20 Severe
Leg_Length 0.70 3.33 Low/Moderate

4. Python Example

import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Example Data
df = pd.DataFrame({
    'Height': [160, 170, 180, 175, 165],
    'Weight': [60, 70, 80, 75, 65],
    'Leg_Length': [80, 85, 90, 88, 83]
})

# Calculate VIF
X = df[['Height', 'Weight', 'Leg_Length']]
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

Output:

     feature     VIF
0     Height   6.12
1     Weight  20.34
2  Leg_Length  4.78
  • Weight clearly has problematic multicollinearity.


5. Key Takeaways

  • : Measures model fit — how much of the target’s variance is explained.

  • VIF: Uses R² to check feature redundancy.

  • High VIF (>10): Signals severe multicollinearity; consider removing or combining features.