L1 vs L2 Regularization — A Simple Hands-On Guide
When training regression models, you might run into overfitting — where the model learns patterns from noise instead of real trends.
Two popular techniques to combat this are L1 (Lasso) and L2 (Ridge) regularization.
In this post, we’ll walk through a small dataset and see, step by step, how these methods impact:
-
Model weights
-
Feature selection
-
Performance
The Plan
We’ll explore:
-
Dataset creation and setup
-
Linear Regression without regularization
-
L1 Regularization (Lasso)
-
L2 Regularization (Ridge)
-
Side-by-side comparison
-
Key takeaways + Python code
Step 1: Our Toy Dataset
We’ll make a small synthetic dataset with some useful features and some noise.
Features:
-
X1: Strong correlation with target (important)
-
X2: Weak correlation (partially relevant)
-
X3, X4: Noise features (irrelevant)
Target (Y): A linear combination of X1 and X2 plus a little noise.
X1 |
X2 |
X3 |
X4 |
Y |
1.0 |
2.0 |
-0.5 |
1.2 |
4.5 |
2.0 |
0.8 |
3.0 |
0.5 |
5.0 |
1.5 |
1.5 |
-1.0 |
2.3 |
5.5 |
2.2 |
1.0 |
0.2 |
3.1 |
6.8 |
3.0 |
2.5 |
-1.5 |
1.5 |
9.0 |
Step 2: Linear Regression (No Regularization)
A plain linear regression model minimizes the Mean Squared Error (MSE):
What Happens
Feature |
Weight |
X1 |
2.5 |
X2 |
1.3 |
X3 |
0.8 |
X4 |
0.6 |
✅ Observation: Noise features (X3, X4) are influencing predictions.
Step 3: L1 Regularization (Lasso)
L1 adds a penalty on the absolute value of weights:
Impact
Feature |
Weight |
X1 |
2.4 |
X2 |
1.2 |
X3 |
0.0 |
X4 |
0.0 |
✅ Observation: Irrelevant features are dropped completely.
Step 4: L2 Regularization (Ridge)
L2 adds a penalty on the squared value of weights:
Impact
-
Shrinks weights towards zero, but never fully removes them.
-
Reduces the influence of less important features.
Feature |
Weight |
X1 |
2.2 |
X2 |
1.1 |
X3 |
0.3 |
X4 |
0.2 |
✅ Observation: All features remain, but noise features have smaller weights.
Step 5: Side-by-Side Comparison
Aspect |
No Reg. |
L1 (Lasso) |
L2 (Ridge) |
X1 Weight |
2.5 |
2.4 |
2.2 |
X2 Weight |
1.3 |
1.2 |
1.1 |
X3 Weight |
0.8 |
0.0 |
0.3 |
X4 Weight |
0.6 |
0.0 |
0.2 |
Overfitting Risk |
High |
Low |
Low |
Feature Selection |
No |
Yes |
No |
Step 6: Takeaways
-
No Regularization: Risks overfitting; all features get weights.
-
L1 (Lasso): Best when you want feature selection; creates sparse models.
-
L2 (Ridge): Best when all features matter but need their effects controlled.
Python Example
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
import numpy as np
# Data
X = np.array([[1.0, 2.0, -0.5, 1.2],
[2.0, 0.8, 3.0, 0.5],
[1.5, 1.5, -1.0, 2.3],
[2.2, 1.0, 0.2, 3.1],
[3.0, 2.5, -1.5, 1.5]])
y = np.array([4.5, 5.0, 5.5, 6.8, 9.0])
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Models
lr = LinearRegression().fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
ridge = Ridge(alpha=0.1).fit(X_train, y_train)
# Output Weights
print("Linear:", lr.coef_)
print("Lasso :", lasso.coef_)
print("Ridge :", ridge.coef_)
๐ก In short:
-
Use Lasso if you want to automatically drop irrelevant features.
-
Use Ridge if you want to keep all features but control their influence.
-
Try Elastic Net (L1 + L2) if you want the best of both worlds.