Showing posts with label Feature Selection. Show all posts

Sunday, August 10, 2025

Regression evaluation techniques - credit limit assignment regression model

Scenario

You are building an XGBoost regression model that predicts “How much credit can be safely assigned to a loan requester” based on their application data, financial history, and behavioral data.
The target (y) is a continuous variable — the approved credit limit amount.

1. Accuracy Metrics & Intuition

Since this is regression, classification metrics like AUC or KS don’t apply.
Instead, we use error-based and rank-based metrics:

Metric	Intuition	Example in Credit Limit Context
RMSE (Root Mean Squared Error)	Penalizes large mistakes more heavily. Good for spotting models that make occasional big blunders.	If RMSE = 3,000 and limits range 0–50k, it means large errors like predicting 50k for someone eligible for 10k happen occasionally.
MAE (Mean Absolute Error)	Average size of error regardless of direction. Easier to interpret than RMSE.	MAE = 2,000 means that on average, your credit limit predictions are off by $2,000.
MAPE (Mean Absolute Percentage Error)	Normalizes error relative to actual values.	If MAPE = 15%, predicting 8,500 for someone eligible for 10,000 is a 15% miss.
R² (Coefficient of Determination)	Measures how much variance in actual credit limits your model explains.	R² = 0.65 means the model explains 65% of why different customers get different limits.
Spearman correlation	Focuses on rank ordering — important when exact limit isn’t as critical as ranking applicants from lowest to highest limit eligibility.	Spearman = 0.8 means applicants ranked high by model generally do get higher limits.

2. Typical Acceptable Thresholds for Model Approval

(These are common industry validation ranges — vary by portfolio type)

Metric	Acceptable Range	Why it Matters Here
RMSE	≤ 15–25% of credit limit range	Avoids big over-limit assignments that increase default risk.
MAE	≤ 10–20% of limit range	Keeps average prediction error within acceptable business tolerance.
MAPE	≤ 30–40%	Ensures proportional accuracy across low and high limits.
R²	≥ 0.5 (≥ 0.4 for noisy small-business data)	Shows the model meaningfully explains applicant differences.
Spearman	≥ 0.6–0.7	Keeps rank ordering stable — critical for approval tiers.

3. Why Rank Ordering Can Matter More Than Absolute Accuracy

In credit assignment, the exact dollar figure might be adjusted by policy rules after the model predicts.
What’s critical is ranking applicants correctly:
- If the model thinks Applicant A should have higher limit than Applicant B, that ordering should hold true most of the time.
- This is why Spearman correlation is often a regulatory requirement alongside RMSE.

4. Stability & Predictive Power Checks

For regression, we can still use:

Check	Purpose	Regression Adjustment
PSI (Population Stability Index)	Ensures feature distributions don’t shift drastically between development and monitoring periods.	Use binned feature values, not target values.
CSI (Characteristic Stability Index)	Checks if relationship between feature & target is stable.	For continuous target, bin both feature & target and use mean target per bin for distribution.
Feature Importance	Identify which variables drive credit limits most.	Use gain/cover in XGBoost.
SHAP	Explain predictions to regulators & business.	Works identically for regression.

5. Model Approval Reality

If your RMSE, MAE, MAPE, and Spearman are within target ranges, and PSI/CSI show stability, you’re in a strong position.
Even if one metric is slightly weak, a model can be approved if:
- It beats the current champion model.
- It’s more stable over time.
- It’s more explainable and policy-compliant.
Failing both accuracy and stability → high risk of rejection.

IV, PSI, CSI - differences

let’s frame this in a churn prediction context, because that’s a very common case where people see IV, PSI, and CSI all being used, notice that the formulas look similar, but get confused about why they’re treated differently.

1️⃣ The setting — churn prediction

Target: churn_flag (1 = churned, 0 = stayed).
Feature: avg_monthly_usage (average minutes per month).
Goal: Build a model that predicts churn, and also monitor if the feature is stable over time.

We have:

Train set → Customers from Jan–Mar 2025
OOT1 → Customers from Apr 2025
OOT2 → Customers from May 2025

2️⃣ The same base formula — different contexts

The mathematical core of IV, PSI, and CSI is a weighted log ratio:

\text{metric} = \sum (\text{fraction diff}) \times \log \left( \frac{\text{fraction 1}}{\text{fraction 2}} \right)

The difference is what those “fractions” mean and which datasets are compared.

3️⃣ Information Value (IV)

Question: Does this feature separate churners from non-churners in a single dataset?
Fractions:
- $p_{\text{stay, bin}}$ = fraction of stayers in that bin (within train set)
- $p_{\text{churn, bin}}$ = fraction of churners in that bin (within train set)
Data involved: Only one dataset (e.g., Train).
Use: Feature selection — keep features with high IV (e.g., > 0.02).

Example:

Train:
Low usage: 80% churn, 20% stay
High usage: 10% churn, 90% stay

This produces a high IV → strong predictive power.

4️⃣ Population Stability Index (PSI)

Question: Has the overall feature distribution shifted over time? (no target involved)
Fractions:
- $p_{\text{bin, train}}$ = proportion of customers in that bin in Train (all customers, churned or not)
- $p_{\text{bin, OOT}}$ = proportion of customers in that bin in OOT (all customers, churned or not)
Data involved: Two datasets (e.g., Train vs OOT1).
Use: Detect population drift — if customers’ usage patterns shift, even if churn rate doesn’t change.

Example:

Train:
Low usage: 30% of all customers
High usage: 70%

OOT1:
Low usage: 50% of all customers
High usage: 50%

PSI will be high → customer base composition shifted (maybe more low-usage customers now).

5️⃣ Characteristic Stability Index (CSI)

Question: Has the relationship between the feature and the target changed over time? (concept drift)
Fractions:
- $\text{event\_frac}_{A, \text{bin}}$ = proportion of churners in Train that fall into that bin
- $\text{event\_frac}_{B, \text{bin}}$ = proportion of churners in OOT that fall into that bin
Data involved: Two datasets (Train vs OOT1), target-specific.
Use: Detect changes in target–feature relationship.

Example:

Train churners:
Low usage: 70% of churners
High usage: 30% of churners

OOT1 churners:
Low usage: 50% of churners
High usage: 50%

CSI will be high → churn pattern shifted; low usage no longer dominates churn.

6️⃣ Why they differ even if formula looks same

The formula structure is the same because all three are distribution comparison measures (based on KL divergence-like logic).
But the inputs differ:

IV → compares good vs bad within one dataset.
PSI → compares overall feature distribution across datasets.
CSI → compares event-specific feature distribution across datasets.

That’s why in churn:

A feature can have high IV, low PSI, low CSI → predictive and stable.
Or high IV, high PSI → predictive, but customer profile is shifting (risk for model drift).
Or high IV, high CSI → predictive in train, but churn relationship is changing (concept drift).

Saturday, August 9, 2025

Variance Inflation Factor (VIF)

Understanding Variance Inflation Factor (VIF) — An Intuitive Guide

What is VIF?

The Variance Inflation Factor (VIF) is a measure that indicates how much a predictor variable is correlated with other predictors in your dataset. It’s a key tool for detecting multicollinearity—a condition where predictors are highly correlated, potentially causing instability in regression models.

Why Multicollinearity Matters

When predictors overlap in the information they provide:

The model struggles to determine which feature is truly influencing the target.
Coefficient estimates can become unstable and unreliable.
Interpretability suffers, making it harder to trust the model.

How VIF is Calculated (Intuitively)

Choose a predictor variable (e.g., X₁).
Regress X₁ against all the other predictors (X₂, X₃, …, Xₙ).
- Essentially: “Can the other features predict X₁?”
Calculate R² — the proportion of variance in X₁ explained by the others.
Apply the formula:
```
VIF = 1 / (1 - R²)
```
- Low R² → Denominator close to 1 → VIF ≈ 1 (low correlation).
- High R² → Denominator small → VIF large (high correlation).

How to Interpret VIF Values

VIF Value	Meaning
1	No correlation with other features (ideal)
< 5	Acceptable
5–10	Moderate to high correlation — monitor closely
> 10	Severe multicollinearity — problematic

An Intuitive Example

Imagine two features: height and leg length.
If leg length is almost always a fixed fraction of height:

Regressing leg length on height would yield a very high R².
The VIF for leg length would be large, signaling redundancy.

Purpose of VIF in Modeling

Identifies redundant predictors.
Helps decide whether to drop or combine correlated features.
Improves model stability and interpretability.

Key Takeaway

Question VIF answers: “Can I predict this feature using the others?”
High VIF: Strong multicollinearity → unstable estimates.
Low VIF: Predictors are relatively independent → better modeling performance.

Understanding R² and VIF — From Model Fit to Multicollinearity

When building regression models, two concepts often come up together: R² (R-squared) and VIF (Variance Inflation Factor).
One measures how well your model fits the data, while the other checks for redundancy between predictors.
Let’s break them down intuitively and see how they connect.

1. What is R²?

R² — also known as the coefficient of determination — tells you how well your model’s predictions match the actual data.

R² = 1 → Perfect fit (model predictions match data exactly)
R² = 0 → Model explains none of the variation (as good as predicting the mean)
R² < 0 → Worse than just predicting the mean

What R² Really Measures

It represents the proportion of variance in the target variable explained by the model.
For example:

R² = 0.70 → 70% of the target’s variation is explained by the predictors.

How It’s Calculated

$R² = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$

Where:

SS_res = Sum of squared residuals (errors between actual & predicted)
SS_tot = Total sum of squares (variance of actual values from the mean)

R² Interpretation Table

R² Value	Meaning
1	Perfect prediction
0.7	Explains 70% of variance
0	No predictive power
< 0	Worse than mean prediction

2. How R² Relates to VIF

The Variance Inflation Factor (VIF) uses R² behind the scenes to detect multicollinearity — when predictors are highly correlated with each other.

For each predictor, we run a regression of that predictor on all the other predictors.
We calculate R² for that regression.
VIF is then:

$\text{VIF} = \frac{1}{1 - R²}$

High R² ⇒ High VIF ⇒ High multicollinearity

3. Step-by-Step VIF Example

Imagine we have three predictors:

Height	Weight	Leg_Length
160	60	80
170	70	85
180	80	90
175	75	88
165	65	83

Let’s calculate VIF for Weight.

Step 1: Regress “Weight” on the Other Predictors

We fit:

$Weight = a + b1*Height + b2*Leg_Length + error$

Step 2: Calculate R²

Suppose we get R² = 0.95 — meaning Height and Leg_Length together explain 95% of Weight’s variance.

Step 3: Compute VIF

$\text{VIF} = \frac{1}{1 - 0.95} = \frac{1}{0.05} = 20$

Interpretation: VIF of 20 is extremely high — Weight is almost redundant given the other two predictors.

VIF Summary Table

Variable	R² with others	VIF	Multicollinearity?
Height	0.80	5	Moderate
Weight	0.95	20	Severe
Leg_Length	0.70	3.33	Low/Moderate

4. Python Example

import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Example Data
df = pd.DataFrame({
    'Height': [160, 170, 180, 175, 165],
    'Weight': [60, 70, 80, 75, 65],
    'Leg_Length': [80, 85, 90, 88, 83]
})

# Calculate VIF
X = df[['Height', 'Weight', 'Leg_Length']]
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

Output:

     feature     VIF
0     Height   6.12
1     Weight  20.34
2  Leg_Length  4.78

Weight clearly has problematic multicollinearity.

5. Key Takeaways

R²: Measures model fit — how much of the target’s variance is explained.
VIF: Uses R² to check feature redundancy.
High VIF (>10): Signals severe multicollinearity; consider removing or combining features.

Feature selection techniques guide

Feature Selection Techniques for Regression & Classification

Feature selection techniques can be grouped into three main stages:

Filter Methods (Before Modeling) – purely statistical or rule-based, no model needed.
Embedded Methods (During Modeling) – selection happens while training.
Wrapper Methods (After Modeling) – iterative, model-based evaluation.

1. Filter Methods — Pre-Model Selection

These rely on statistical tests and relationships between features and target.
They are fast, model-agnostic, and help remove irrelevant or redundant features early.

Technique	Works For	Purpose / Usage	Advantages	Disadvantages
Correlation Analysis	Regression & Classification	Measures linear relationship between features and target (e.g., Pearson, Spearman).	Simple, quick redundancy detection.	Only captures linear relationships, ignores interactions.
VIF (Variance Inflation Factor)	Regression	Detects multicollinearity in predictors to improve regression stability.	Identifies redundant predictors.	Only applies to linear regression; needs numerical/dummy-encoded data.
IV (Information Value)	Classification (binary)	Quantifies a variable’s ability to separate two classes.	Interpretable, great for credit scoring.	Binary classification only; needs binning for continuous data.
Chi-Square Test	Classification	Tests statistical dependence between categorical features and target.	Works well with categorical data.	Requires categorical variables; not for continuous targets.
ANOVA F-test	Regression & Classification	Tests if means of numerical feature differ significantly across target groups.	Good for numerical vs categorical target relationship.	Assumes normally distributed data; no interactions.
Mutual Information	Regression & Classification	Measures general dependency (linear + nonlinear) between feature and target.	Captures non-linear relationships.	Computationally heavier than correlation.
Bivariate Analysis	Regression & Classification	Compares target statistics (mean, proportion) across feature bins/categories.	Easy visual interpretation.	Summarizes only one feature at a time; no interactions.

2. Embedded Methods — Model-Integrated Selection

Feature selection happens during model training, influenced by the learning algorithm.
Best for keeping predictive and interpretable features.

Technique	Works For	Purpose / Usage	Advantages	Disadvantages
L1 Regularization (Lasso)	Regression & Classification	Shrinks less important feature coefficients to zero, effectively removing them.	Produces sparse, interpretable models.	May drop correlated useful features.
Elastic Net	Regression & Classification	Combines L1 (sparse) and L2 (stability) penalties for balanced selection.	Handles correlated features better than Lasso.	Needs hyperparameter tuning.
Tree-based Feature Importance	Regression & Classification	Measures how much each feature reduces impurity (Gini, MSE) in tree splits.	Works for non-linear, interaction-rich data.	Can be biased towards high-cardinality features.
SHAP Values	Regression & Classification	Model-agnostic method attributing contributions of features to predictions.	Explains both global and per-instance effects.	Computationally expensive for large datasets.

3. Wrapper Methods — Iterative Search

These methods repeatedly train and test models with different subsets of features to find the best combination.

Technique	Works For	Purpose / Usage	Advantages	Disadvantages
Recursive Feature Elimination (RFE)	Regression & Classification	Iteratively removes least important features until desired number is reached.	Finds optimal subset for model performance.	Slow for large datasets; requires multiple model fits.
Sequential Feature Selection	Regression & Classification	Adds or removes features step-by-step based on model performance.	Simple and interpretable process.	Computationally expensive.
Permutation Importance	Regression & Classification	Measures drop in model performance when a feature’s values are shuffled.	Works for any model; easy to interpret.	Requires trained model; may be unstable with correlated features.

4. Specialized / Dimensionality Reduction & Domain Knowledge

These are niche but powerful, especially for high-dimensional data.

Technique	Works For	Purpose / Usage	Advantages	Disadvantages
Principal Component Analysis (PCA)	Regression & Classification	Transforms correlated features into uncorrelated components.	Reduces dimensionality while keeping variance.	Components lose original feature meaning.
Domain Knowledge Filtering	Both	Remove irrelevant features based on business or scientific understanding.	Improves interpretability and model relevance.	Relies on expert input; risk of bias.

Practical Usage Flow

Initial Filtering → Correlation, IV, VIF, Chi-Square, Mutual Information, ANOVA.
Model-Integrated Refinement → Lasso, Elastic Net, Tree-based Importance, SHAP.
Performance Optimization → RFE, Sequential Selection, Permutation Importance.
Special Cases → PCA for dimensionality reduction, domain filtering for expert-driven refinement.

Feature Selection techniques

SHAP, IV, VIF, Bivariate Analysis, Correlation & Feature Importance — A Complete Guide

When it comes to feature selection and interpretation in machine learning, there’s no shortage of tools. But knowing which method to use, when, and why can be confusing.

In this guide, we’ll break down six popular techniques — SHAP, Information Value (IV), Variance Inflation Factor (VIF), bivariate analysis, correlation, and feature importance — exploring their purpose, pros, cons, similarities, differences, and when to use them for numerical and categorical features.

1. SHAP (SHapley Additive exPlanations)

Purpose:
Explains individual predictions by calculating each feature’s contribution, inspired by cooperative game theory.

Why use it:

Works for any model — from decision trees to deep learning.
Offers both local (per observation) and global (overall) explanations.
Handles feature interactions.
Works with numerical and categorical features (native for trees, encoding needed for others).

Limitations:

Computationally heavy for large datasets.
Needs a fitted model.
Interpretation can be tricky at first.

Best for: Explaining complex, high-stakes models where transparency is key.

2. Information Value (IV)

Purpose:
Measures how well a variable separates two classes — ideal for binary classification problems.

Why use it:

Simple and easy to interpret.
Great for initial pre-model feature selection.
Doesn’t require a model.

Limitations:

Only works for binary targets.
Ignores interactions between features.
Continuous variables need binning.

Best for: Credit scoring, risk modeling, and other binary classification tasks.

3. Variance Inflation Factor (VIF)

Purpose:
Detects multicollinearity in regression by showing how much a variable is explained by other variables.

Why use it:

Highlights redundant predictors.
Improves regression stability and interpretability.

Limitations:

Only relevant for linear regression.
Requires numerical or dummy-encoded categorical variables.
Not helpful for tree-based models.

Best for: Preprocessing before running regression models.

4. Bivariate Analysis

Purpose:
Examines the relationship between one feature and the target — often through visual summaries like group means or bar plots.

Why use it:

Intuitive and visual.
Works for any feature type.

Limitations:

Only looks at one feature at a time.
Doesn’t provide a formal quantitative score.

Best for: Early exploratory data analysis (EDA) to spot obvious patterns.

5. Correlation

Purpose:
Measures linear association between two variables.

Why use it:

Quick, easy, and interpretable.
Useful for spotting redundancy.

Limitations:

Only captures linear relationships.
Pairwise only — misses more complex multicollinearity.
Sensitive to outliers.

Best for: Quick checks for related features before modeling.

6. Feature Importance

Purpose:
Shows how much each feature contributes to predictions in a trained model.

Why use it:

Model-driven insights.
Works for any model type.
Handles feature interactions.

Limitations:

Can be biased if features are correlated.
Requires a trained model.
May vary depending on algorithm.

Best for: Post-model analysis and refining models.

Comparison at a Glance

Method	Purpose	Pros	Cons	Numerical	Categorical
SHAP	Explain predictions	Handles interactions	Slow, complex	Yes	Yes
IV	Pre-model selection	Simple, interpretable	Binary only, binning needed	Yes (bin)	Yes
VIF	Multicollinearity	Regression stability	Linear only	Yes	Yes (encode)
Bivariate Analysis	Relationship check	Visual, simple	No interactions	Yes (bin)	Yes
Correlation	Association check	Simple, fast	Linear only, pairwise	Yes	Yes (encode)
Feature Importance	Model-driven	Handles interactions	Needs model, bias possible	Yes	Yes

Similarities & Differences

Similarities:

All assist with feature selection.
Most work with both numerical and categorical data (some need encoding).
Some methods are pre-model (IV, bivariate, correlation), others post-model (SHAP, feature importance).

Differences:

SHAP and feature importance require a trained model.
VIF and correlation both assess redundancy, but VIF considers all features together while correlation is pairwise.
IV works only for binary targets.

Key Takeaways

For early feature selection: Use IV, bivariate analysis, and correlation.
For redundancy checks in regression: Use VIF.
For interpreting model predictions: Use SHAP and feature importance.
Always remember: encoding matters for some methods, especially correlation and VIF.