L1 vs L2 Regularization — The Complete Guide (with Elastic Net)
When building machine learning models, it’s easy to fall into the overfitting trap — where your model learns noise instead of real patterns.
Regularization is one of the best ways to fight this.
Two of the most widely used regularization techniques are:
- 
L1 Regularization (Lasso) 
- 
L2 Regularization (Ridge) 
Both add a penalty term to the loss function, discouraging overly complex models. Let’s break them down.
1. L1 Regularization (Lasso)
Definition:
Adds the absolute value of the weights as a penalty term to the loss function:
Where:
- 
= weight of the i-th feature 
- 
= regularization strength (higher = more penalty) 
Key Characteristics:
- 
Encourages sparsity (many weights become exactly zero) 
- 
Naturally performs feature selection 
- 
Works best when only a subset of features is truly relevant 
When to Use:
- 
High-dimensional datasets (e.g., text classification, genetics) 
- 
When you expect many features to be irrelevant 
Example:
Predicting house prices with 100 features → L1 might keep only the 10 most important ones (e.g., square footage, location) and set the rest to zero.
2. L2 Regularization (Ridge)
Definition:
Adds the squared value of the weights as a penalty term to the loss function:
Key Characteristics:
- 
Encourages small weights (closer to zero but not exactly zero) 
- 
Reduces the influence of any single feature without removing it entirely 
- 
Works best when all features are useful 
When to Use:
- 
You believe all features have some predictive power 
- 
You want to avoid overfitting but keep every feature in play 
- 
Useful for correlated features 
Example:
Predicting house prices → All features (square footage, bedrooms, bathrooms, etc.) contribute, but L2 ensures no single one dominates.
3. Side-by-Side: L1 vs L2
| Aspect | L1 (Lasso) | L2 (Ridge) | 
|---|---|---|
| Penalty Term | ( \lambda \sum | w_i | 
| Effect on Weights | Many become exactly zero | All become small, non-zero | 
| Feature Selection | ✅ Yes | ❌ No | 
| Optimization | Harder (non-differentiable at zero) | Easier (fully differentiable) | 
| Best For | Sparse models, irrelevant features | Regularizing all features | 
4. Elastic Net — The Best of Both Worlds
Elastic Net combines L1 and L2 penalties:
Why use it?
- 
Retains the feature selection benefits of L1 
- 
Keeps the weight shrinkage benefits of L2 
- 
Especially helpful when features are correlated 
5. Visual Intuition
- 
L1 (Lasso): Diamond-shaped constraint → optimization often lands on corners → many weights exactly zero (sparse solution) 
- 
L2 (Ridge): Circular constraint → optimization lands inside → all weights small, none zero 
6. Choosing the Right Regularization
✅ Use L1 when:
- 
You want a sparse model 
- 
You expect many irrelevant features 
- 
You need automatic feature selection 
✅ Use L2 when:
- 
All features likely matter 
- 
You want to control coefficient size without removing features 
- 
You have multicollinearity (correlated features) 
✅ Use Elastic Net when:
- 
You want a mix of sparsity + stability 
- 
You have many correlated features 
- 
You want to avoid L1’s instability on correlated data 
7. Python Implementation
from sklearn.linear_model import Lasso, Ridge, ElasticNet
# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.1)  # alpha = λ
lasso.fit(X_train, y_train)
# L2 Regularization (Ridge)
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
# Elastic Net
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio balances L1/L2
elastic_net.fit(X_train, y_train)
8. Summary Table
| Regularization | Main Effect | Removes Features? | Best For | 
|---|---|---|---|
| L1 | Sparse weights (zeros) | ✅ Yes | High-dimensional, irrelevant features | 
| L2 | Small, non-zero weights | ❌ No | All features relevant, control magnitude | 
| Elastic Net | Mix of L1 & L2 benefits | Partial | Correlated features + feature selection | 
💡 Takeaway:
- 
Use L1 for feature selection 
- 
Use L2 for controlling weight magnitude 
- 
Use Elastic Net for a balanced approach 
 
No comments:
Post a Comment