Understanding Variance Inflation Factor (VIF) — An Intuitive Guide
What is VIF?
The Variance Inflation Factor (VIF) is a measure that indicates how much a predictor variable is correlated with other predictors in your dataset. It’s a key tool for detecting multicollinearity—a condition where predictors are highly correlated, potentially causing instability in regression models.
Why Multicollinearity Matters
When predictors overlap in the information they provide:
-
The model struggles to determine which feature is truly influencing the target.
-
Coefficient estimates can become unstable and unreliable.
-
Interpretability suffers, making it harder to trust the model.
How VIF is Calculated (Intuitively)
-
Choose a predictor variable (e.g., X₁).
-
Regress X₁ against all the other predictors (X₂, X₃, …, Xₙ).
-
Essentially: “Can the other features predict X₁?”
-
-
Calculate R² — the proportion of variance in X₁ explained by the others.
-
Apply the formula:
VIF = 1 / (1 - R²)
-
Low R² → Denominator close to 1 → VIF ≈ 1 (low correlation).
-
High R² → Denominator small → VIF large (high correlation).
-
How to Interpret VIF Values
VIF Value | Meaning |
---|---|
1 | No correlation with other features (ideal) |
< 5 | Acceptable |
5–10 | Moderate to high correlation — monitor closely |
> 10 | Severe multicollinearity — problematic |
An Intuitive Example
Imagine two features: height and leg length.
If leg length is almost always a fixed fraction of height:
-
Regressing leg length on height would yield a very high R².
-
The VIF for leg length would be large, signaling redundancy.
Purpose of VIF in Modeling
-
Identifies redundant predictors.
-
Helps decide whether to drop or combine correlated features.
-
Improves model stability and interpretability.
Key Takeaway
-
Question VIF answers: “Can I predict this feature using the others?”
-
High VIF: Strong multicollinearity → unstable estimates.
-
Low VIF: Predictors are relatively independent → better modeling performance.
Understanding R² and VIF — From Model Fit to Multicollinearity
When building regression models, two concepts often come up together: R² (R-squared) and VIF (Variance Inflation Factor).
One measures how well your model fits the data, while the other checks for redundancy between predictors.
Let’s break them down intuitively and see how they connect.
1. What is R²?
R² — also known as the coefficient of determination — tells you how well your model’s predictions match the actual data.
-
R² = 1 → Perfect fit (model predictions match data exactly)
-
R² = 0 → Model explains none of the variation (as good as predicting the mean)
-
R² < 0 → Worse than just predicting the mean
What R² Really Measures
It represents the proportion of variance in the target variable explained by the model.
For example:
-
R² = 0.70 → 70% of the target’s variation is explained by the predictors.
How It’s Calculated
Where:
-
SS_res = Sum of squared residuals (errors between actual & predicted)
-
SS_tot = Total sum of squares (variance of actual values from the mean)
R² Interpretation Table
R² Value | Meaning |
---|---|
1 | Perfect prediction |
0.7 | Explains 70% of variance |
0 | No predictive power |
< 0 | Worse than mean prediction |
2. How R² Relates to VIF
The Variance Inflation Factor (VIF) uses R² behind the scenes to detect multicollinearity — when predictors are highly correlated with each other.
-
For each predictor, we run a regression of that predictor on all the other predictors.
-
We calculate R² for that regression.
-
VIF is then:
High R² ⇒ High VIF ⇒ High multicollinearity
3. Step-by-Step VIF Example
Imagine we have three predictors:
Height | Weight | Leg_Length |
---|---|---|
160 | 60 | 80 |
170 | 70 | 85 |
180 | 80 | 90 |
175 | 75 | 88 |
165 | 65 | 83 |
Let’s calculate VIF for Weight.
Step 1: Regress “Weight” on the Other Predictors
We fit:
Weight = a + b1*Height + b2*Leg_Length + error
Step 2: Calculate R²
Suppose we get R² = 0.95 — meaning Height and Leg_Length together explain 95% of Weight’s variance.
Step 3: Compute VIF
Interpretation: VIF of 20 is extremely high — Weight is almost redundant given the other two predictors.
VIF Summary Table
Variable | R² with others | VIF | Multicollinearity? |
---|---|---|---|
Height | 0.80 | 5 | Moderate |
Weight | 0.95 | 20 | Severe |
Leg_Length | 0.70 | 3.33 | Low/Moderate |
4. Python Example
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Example Data
df = pd.DataFrame({
'Height': [160, 170, 180, 175, 165],
'Weight': [60, 70, 80, 75, 65],
'Leg_Length': [80, 85, 90, 88, 83]
})
# Calculate VIF
X = df[['Height', 'Weight', 'Leg_Length']]
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
Output:
feature VIF
0 Height 6.12
1 Weight 20.34
2 Leg_Length 4.78
-
Weight clearly has problematic multicollinearity.
5. Key Takeaways
-
R²: Measures model fit — how much of the target’s variance is explained.
-
VIF: Uses R² to check feature redundancy.
-
High VIF (>10): Signals severe multicollinearity; consider removing or combining features.
No comments:
Post a Comment