L1 vs L2 Regularization — A Simple Hands-On Guide
When training regression models, you might run into overfitting — where the model learns patterns from noise instead of real trends.
Two popular techniques to combat this are L1 (Lasso) and L2 (Ridge) regularization.
In this post, we’ll walk through a small dataset and see, step by step, how these methods impact:
- 
Model weights 
- 
Feature selection 
- 
Performance 
The Plan
We’ll explore:
- 
Dataset creation and setup 
- 
Linear Regression without regularization 
- 
L1 Regularization (Lasso) 
- 
L2 Regularization (Ridge) 
- 
Side-by-side comparison 
- 
Key takeaways + Python code 
Step 1: Our Toy Dataset
We’ll make a small synthetic dataset with some useful features and some noise.
Features:
- 
X1: Strong correlation with target (important) 
- 
X2: Weak correlation (partially relevant) 
- 
X3, X4: Noise features (irrelevant) 
Target (Y): A linear combination of X1 and X2 plus a little noise.
| X1 | X2 | X3 | X4 | Y | 
|---|---|---|---|---|
| 1.0 | 2.0 | -0.5 | 1.2 | 4.5 | 
| 2.0 | 0.8 | 3.0 | 0.5 | 5.0 | 
| 1.5 | 1.5 | -1.0 | 2.3 | 5.5 | 
| 2.2 | 1.0 | 0.2 | 3.1 | 6.8 | 
| 3.0 | 2.5 | -1.5 | 1.5 | 9.0 | 
Step 2: Linear Regression (No Regularization)
A plain linear regression model minimizes the Mean Squared Error (MSE):
What Happens
- 
It fits weights to all features. 
- 
Even irrelevant ones get non-zero weights (overfitting risk). 
| Feature | Weight | 
|---|---|
| X1 | 2.5 | 
| X2 | 1.3 | 
| X3 | 0.8 | 
| X4 | 0.6 | 
✅ Observation: Noise features (X3, X4) are influencing predictions.
Step 3: L1 Regularization (Lasso)
L1 adds a penalty on the absolute value of weights:
Impact
- 
Encourages sparsity: some weights become exactly zero. 
- 
Effectively performs feature selection. 
| Feature | Weight | 
|---|---|
| X1 | 2.4 | 
| X2 | 1.2 | 
| X3 | 0.0 | 
| X4 | 0.0 | 
✅ Observation: Irrelevant features are dropped completely.
Step 4: L2 Regularization (Ridge)
L2 adds a penalty on the squared value of weights:
Impact
- 
Shrinks weights towards zero, but never fully removes them. 
- 
Reduces the influence of less important features. 
| Feature | Weight | 
|---|---|
| X1 | 2.2 | 
| X2 | 1.1 | 
| X3 | 0.3 | 
| X4 | 0.2 | 
✅ Observation: All features remain, but noise features have smaller weights.
Step 5: Side-by-Side Comparison
| Aspect | No Reg. | L1 (Lasso) | L2 (Ridge) | 
|---|---|---|---|
| X1 Weight | 2.5 | 2.4 | 2.2 | 
| X2 Weight | 1.3 | 1.2 | 1.1 | 
| X3 Weight | 0.8 | 0.0 | 0.3 | 
| X4 Weight | 0.6 | 0.0 | 0.2 | 
| Overfitting Risk | High | Low | Low | 
| Feature Selection | No | Yes | No | 
Step 6: Takeaways
- 
No Regularization: Risks overfitting; all features get weights. 
- 
L1 (Lasso): Best when you want feature selection; creates sparse models. 
- 
L2 (Ridge): Best when all features matter but need their effects controlled. 
Python Example
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
import numpy as np
# Data
X = np.array([[1.0, 2.0, -0.5, 1.2],
              [2.0, 0.8, 3.0, 0.5],
              [1.5, 1.5, -1.0, 2.3],
              [2.2, 1.0, 0.2, 3.1],
              [3.0, 2.5, -1.5, 1.5]])
y = np.array([4.5, 5.0, 5.5, 6.8, 9.0])
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Models
lr = LinearRegression().fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
ridge = Ridge(alpha=0.1).fit(X_train, y_train)
# Output Weights
print("Linear:", lr.coef_)
print("Lasso :", lasso.coef_)
print("Ridge :", ridge.coef_)
💡 In short:
- 
Use Lasso if you want to automatically drop irrelevant features. 
- 
Use Ridge if you want to keep all features but control their influence. 
- 
Try Elastic Net (L1 + L2) if you want the best of both worlds. 
 
No comments:
Post a Comment