L1 vs L2 Regularization — The Complete Guide (with Elastic Net)
When building machine learning models, it’s easy to fall into the overfitting trap — where your model learns noise instead of real patterns.
Regularization is one of the best ways to fight this.
Two of the most widely used regularization techniques are:
-
L1 Regularization (Lasso)
-
L2 Regularization (Ridge)
Both add a penalty term to the loss function, discouraging overly complex models. Let’s break them down.
1. L1 Regularization (Lasso)
Definition:
Adds the absolute value of the weights as a penalty term to the loss function:
Where:
-
= weight of the i-th feature
-
= regularization strength (higher = more penalty)
Key Characteristics:
-
Encourages sparsity (many weights become exactly zero)
-
Naturally performs feature selection
-
Works best when only a subset of features is truly relevant
When to Use:
-
High-dimensional datasets (e.g., text classification, genetics)
-
When you expect many features to be irrelevant
Example:
Predicting house prices with 100 features → L1 might keep only the 10 most important ones (e.g., square footage, location) and set the rest to zero.
2. L2 Regularization (Ridge)
Definition:
Adds the squared value of the weights as a penalty term to the loss function:
Key Characteristics:
-
Encourages small weights (closer to zero but not exactly zero)
-
Reduces the influence of any single feature without removing it entirely
-
Works best when all features are useful
When to Use:
-
You believe all features have some predictive power
-
You want to avoid overfitting but keep every feature in play
-
Useful for correlated features
Example:
Predicting house prices → All features (square footage, bedrooms, bathrooms, etc.) contribute, but L2 ensures no single one dominates.
3. Side-by-Side: L1 vs L2
Aspect | L1 (Lasso) | L2 (Ridge) |
---|---|---|
Penalty Term | ( \lambda \sum | w_i |
Effect on Weights | Many become exactly zero | All become small, non-zero |
Feature Selection | ✅ Yes | ❌ No |
Optimization | Harder (non-differentiable at zero) | Easier (fully differentiable) |
Best For | Sparse models, irrelevant features | Regularizing all features |
4. Elastic Net — The Best of Both Worlds
Elastic Net combines L1 and L2 penalties:
Why use it?
-
Retains the feature selection benefits of L1
-
Keeps the weight shrinkage benefits of L2
-
Especially helpful when features are correlated
5. Visual Intuition
-
L1 (Lasso): Diamond-shaped constraint → optimization often lands on corners → many weights exactly zero (sparse solution)
-
L2 (Ridge): Circular constraint → optimization lands inside → all weights small, none zero
6. Choosing the Right Regularization
✅ Use L1 when:
-
You want a sparse model
-
You expect many irrelevant features
-
You need automatic feature selection
✅ Use L2 when:
-
All features likely matter
-
You want to control coefficient size without removing features
-
You have multicollinearity (correlated features)
✅ Use Elastic Net when:
-
You want a mix of sparsity + stability
-
You have many correlated features
-
You want to avoid L1’s instability on correlated data
7. Python Implementation
from sklearn.linear_model import Lasso, Ridge, ElasticNet
# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.1) # alpha = λ
lasso.fit(X_train, y_train)
# L2 Regularization (Ridge)
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
# Elastic Net
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5) # l1_ratio balances L1/L2
elastic_net.fit(X_train, y_train)
8. Summary Table
Regularization | Main Effect | Removes Features? | Best For |
---|---|---|---|
L1 | Sparse weights (zeros) | ✅ Yes | High-dimensional, irrelevant features |
L2 | Small, non-zero weights | ❌ No | All features relevant, control magnitude |
Elastic Net | Mix of L1 & L2 benefits | Partial | Correlated features + feature selection |
💡 Takeaway:
-
Use L1 for feature selection
-
Use L2 for controlling weight magnitude
-
Use Elastic Net for a balanced approach
No comments:
Post a Comment