L1 vs L2 Regularization — A Simple Hands-On Guide
When training regression models, you might run into overfitting — where the model learns patterns from noise instead of real trends.
Two popular techniques to combat this are L1 (Lasso) and L2 (Ridge) regularization.
In this post, we’ll walk through a small dataset and see, step by step, how these methods impact:
-
Model weights
-
Feature selection
-
Performance
The Plan
We’ll explore:
-
Dataset creation and setup
-
Linear Regression without regularization
-
L1 Regularization (Lasso)
-
L2 Regularization (Ridge)
-
Side-by-side comparison
-
Key takeaways + Python code
Step 1: Our Toy Dataset
We’ll make a small synthetic dataset with some useful features and some noise.
Features:
-
X1: Strong correlation with target (important)
-
X2: Weak correlation (partially relevant)
-
X3, X4: Noise features (irrelevant)
Target (Y): A linear combination of X1 and X2 plus a little noise.
X1 | X2 | X3 | X4 | Y |
---|---|---|---|---|
1.0 | 2.0 | -0.5 | 1.2 | 4.5 |
2.0 | 0.8 | 3.0 | 0.5 | 5.0 |
1.5 | 1.5 | -1.0 | 2.3 | 5.5 |
2.2 | 1.0 | 0.2 | 3.1 | 6.8 |
3.0 | 2.5 | -1.5 | 1.5 | 9.0 |
Step 2: Linear Regression (No Regularization)
A plain linear regression model minimizes the Mean Squared Error (MSE):
What Happens
-
It fits weights to all features.
-
Even irrelevant ones get non-zero weights (overfitting risk).
Feature | Weight |
---|---|
X1 | 2.5 |
X2 | 1.3 |
X3 | 0.8 |
X4 | 0.6 |
✅ Observation: Noise features (X3, X4) are influencing predictions.
Step 3: L1 Regularization (Lasso)
L1 adds a penalty on the absolute value of weights:
Impact
-
Encourages sparsity: some weights become exactly zero.
-
Effectively performs feature selection.
Feature | Weight |
---|---|
X1 | 2.4 |
X2 | 1.2 |
X3 | 0.0 |
X4 | 0.0 |
✅ Observation: Irrelevant features are dropped completely.
Step 4: L2 Regularization (Ridge)
L2 adds a penalty on the squared value of weights:
Impact
-
Shrinks weights towards zero, but never fully removes them.
-
Reduces the influence of less important features.
Feature | Weight |
---|---|
X1 | 2.2 |
X2 | 1.1 |
X3 | 0.3 |
X4 | 0.2 |
✅ Observation: All features remain, but noise features have smaller weights.
Step 5: Side-by-Side Comparison
Aspect | No Reg. | L1 (Lasso) | L2 (Ridge) |
---|---|---|---|
X1 Weight | 2.5 | 2.4 | 2.2 |
X2 Weight | 1.3 | 1.2 | 1.1 |
X3 Weight | 0.8 | 0.0 | 0.3 |
X4 Weight | 0.6 | 0.0 | 0.2 |
Overfitting Risk | High | Low | Low |
Feature Selection | No | Yes | No |
Step 6: Takeaways
-
No Regularization: Risks overfitting; all features get weights.
-
L1 (Lasso): Best when you want feature selection; creates sparse models.
-
L2 (Ridge): Best when all features matter but need their effects controlled.
Python Example
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
import numpy as np
# Data
X = np.array([[1.0, 2.0, -0.5, 1.2],
[2.0, 0.8, 3.0, 0.5],
[1.5, 1.5, -1.0, 2.3],
[2.2, 1.0, 0.2, 3.1],
[3.0, 2.5, -1.5, 1.5]])
y = np.array([4.5, 5.0, 5.5, 6.8, 9.0])
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Models
lr = LinearRegression().fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
ridge = Ridge(alpha=0.1).fit(X_train, y_train)
# Output Weights
print("Linear:", lr.coef_)
print("Lasso :", lasso.coef_)
print("Ridge :", ridge.coef_)
💡 In short:
-
Use Lasso if you want to automatically drop irrelevant features.
-
Use Ridge if you want to keep all features but control their influence.
-
Try Elastic Net (L1 + L2) if you want the best of both worlds.
No comments:
Post a Comment