Saturday, August 9, 2025

Regularization

L1 vs L2 Regularization — The Complete Guide (with Elastic Net)

When building machine learning models, it’s easy to fall into the overfitting trap — where your model learns noise instead of real patterns.
Regularization is one of the best ways to fight this.

Two of the most widely used regularization techniques are:

  • L1 Regularization (Lasso)

  • L2 Regularization (Ridge)

Both add a penalty term to the loss function, discouraging overly complex models. Let’s break them down.


1. L1 Regularization (Lasso)

Definition:
Adds the absolute value of the weights as a penalty term to the loss function:

Loss=Original_Loss+λiwiLoss = Original\_Loss + \lambda \sum_{i} |w_i|

Where:

  • wiw_i = weight of the i-th feature

  • λ\lambda = regularization strength (higher = more penalty)

Key Characteristics:

  • Encourages sparsity (many weights become exactly zero)

  • Naturally performs feature selection

  • Works best when only a subset of features is truly relevant

When to Use:

  • High-dimensional datasets (e.g., text classification, genetics)

  • When you expect many features to be irrelevant

Example:
Predicting house prices with 100 features → L1 might keep only the 10 most important ones (e.g., square footage, location) and set the rest to zero.


2. L2 Regularization (Ridge)

Definition:
Adds the squared value of the weights as a penalty term to the loss function:

Loss=Original_Loss+λiwi2Loss = Original\_Loss + \lambda \sum_{i} w_i^2

Key Characteristics:

  • Encourages small weights (closer to zero but not exactly zero)

  • Reduces the influence of any single feature without removing it entirely

  • Works best when all features are useful

When to Use:

  • You believe all features have some predictive power

  • You want to avoid overfitting but keep every feature in play

  • Useful for correlated features

Example:
Predicting house prices → All features (square footage, bedrooms, bathrooms, etc.) contribute, but L2 ensures no single one dominates.


3. Side-by-Side: L1 vs L2

Aspect L1 (Lasso) L2 (Ridge)
Penalty Term ( \lambda \sum w_i
Effect on Weights Many become exactly zero All become small, non-zero
Feature Selection ✅ Yes ❌ No
Optimization Harder (non-differentiable at zero) Easier (fully differentiable)
Best For Sparse models, irrelevant features Regularizing all features

4. Elastic Net — The Best of Both Worlds

Elastic Net combines L1 and L2 penalties:

Loss=Original_Loss+αλwi+(1α)λwi2Loss = Original\_Loss + \alpha \lambda \sum |w_i| + (1-\alpha) \lambda \sum w_i^2

Why use it?

  • Retains the feature selection benefits of L1

  • Keeps the weight shrinkage benefits of L2

  • Especially helpful when features are correlated


5. Visual Intuition

  • L1 (Lasso): Diamond-shaped constraint → optimization often lands on corners → many weights exactly zero (sparse solution)

  • L2 (Ridge): Circular constraint → optimization lands inside → all weights small, none zero


6. Choosing the Right Regularization

Use L1 when:

  • You want a sparse model

  • You expect many irrelevant features

  • You need automatic feature selection

Use L2 when:

  • All features likely matter

  • You want to control coefficient size without removing features

  • You have multicollinearity (correlated features)

Use Elastic Net when:

  • You want a mix of sparsity + stability

  • You have many correlated features

  • You want to avoid L1’s instability on correlated data


7. Python Implementation

from sklearn.linear_model import Lasso, Ridge, ElasticNet

# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.1)  # alpha = λ
lasso.fit(X_train, y_train)

# L2 Regularization (Ridge)
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)

# Elastic Net
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio balances L1/L2
elastic_net.fit(X_train, y_train)

8. Summary Table

Regularization Main Effect Removes Features? Best For
L1 Sparse weights (zeros) ✅ Yes High-dimensional, irrelevant features
L2 Small, non-zero weights ❌ No All features relevant, control magnitude
Elastic Net Mix of L1 & L2 benefits Partial Correlated features + feature selection

💡 Takeaway:

  • Use L1 for feature selection

  • Use L2 for controlling weight magnitude

  • Use Elastic Net for a balanced approach



No comments:

Post a Comment