L1 vs L2 Regularization — A Simple Hands-On Guide

When training regression models, you might run into overfitting — where the model learns patterns from noise instead of real trends.
Two popular techniques to combat this are L1 (Lasso) and L2 (Ridge) regularization.

In this post, we’ll walk through a small dataset and see, step by step, how these methods impact:

Model weights
Feature selection
Performance

The Plan

We’ll explore:

Dataset creation and setup
Linear Regression without regularization
L1 Regularization (Lasso)
L2 Regularization (Ridge)
Side-by-side comparison
Key takeaways + Python code

Step 1: Our Toy Dataset

We’ll make a small synthetic dataset with some useful features and some noise.

Features:

X1: Strong correlation with target (important)
X2: Weak correlation (partially relevant)
X3, X4: Noise features (irrelevant)

Target (Y): A linear combination of X1 and X2 plus a little noise.

X1	X2	X3	X4	Y
1.0	2.0	-0.5	1.2	4.5
2.0	0.8	3.0	0.5	5.0
1.5	1.5	-1.0	2.3	5.5
2.2	1.0	0.2	3.1	6.8
3.0	2.5	-1.5	1.5	9.0

Step 2: Linear Regression (No Regularization)

A plain linear regression model minimizes the Mean Squared Error (MSE):

$Loss = \frac{1}{n} \sum (Y - \hat{Y})^2$

What Happens

It fits weights to all features.
Even irrelevant ones get non-zero weights (overfitting risk).

Feature	Weight
X1	2.5
X2	1.3
X3	0.8
X4	0.6

✅ Observation: Noise features (X3, X4) are influencing predictions.

Step 3: L1 Regularization (Lasso)

L1 adds a penalty on the absolute value of weights:

$Loss = \frac{1}{n} \sum (Y - \hat{Y})^2 + \lambda \sum |w_i|$

Impact

Encourages sparsity: some weights become exactly zero.
Effectively performs feature selection.

Feature	Weight
X1	2.4
X2	1.2
X3	0.0
X4	0.0

✅ Observation: Irrelevant features are dropped completely.

Step 4: L2 Regularization (Ridge)

L2 adds a penalty on the squared value of weights:

$Loss = \frac{1}{n} \sum (Y - \hat{Y})^2 + \lambda \sum w_i^2$

Impact

Shrinks weights towards zero, but never fully removes them.
Reduces the influence of less important features.

Feature	Weight
X1	2.2
X2	1.1
X3	0.3
X4	0.2

✅ Observation: All features remain, but noise features have smaller weights.

Step 5: Side-by-Side Comparison

Aspect	No Reg.	L1 (Lasso)	L2 (Ridge)
X1 Weight	2.5	2.4	2.2
X2 Weight	1.3	1.2	1.1
X3 Weight	0.8	0.0	0.3
X4 Weight	0.6	0.0	0.2
Overfitting Risk	High	Low	Low
Feature Selection	No	Yes	No

Step 6: Takeaways

No Regularization: Risks overfitting; all features get weights.
L1 (Lasso): Best when you want feature selection; creates sparse models.
L2 (Ridge): Best when all features matter but need their effects controlled.

Python Example

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
import numpy as np

# Data
X = np.array([[1.0, 2.0, -0.5, 1.2],
              [2.0, 0.8, 3.0, 0.5],
              [1.5, 1.5, -1.0, 2.3],
              [2.2, 1.0, 0.2, 3.1],
              [3.0, 2.5, -1.5, 1.5]])
y = np.array([4.5, 5.0, 5.5, 6.8, 9.0])

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Models
lr = LinearRegression().fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
ridge = Ridge(alpha=0.1).fit(X_train, y_train)

# Output Weights
print("Linear:", lr.coef_)
print("Lasso :", lasso.coef_)
print("Ridge :", ridge.coef_)

💡 In short:

Use Lasso if you want to automatically drop irrelevant features.
Use Ridge if you want to keep all features but control their influence.
Try Elastic Net (L1 + L2) if you want the best of both worlds.

Bigdata and data science by Kartheek Dachepalli

Saturday, August 9, 2025

Explaining L1, L2 Regularization with realtime example

L1 vs L2 Regularization — A Simple Hands-On Guide

The Plan

Step 1: Our Toy Dataset

Step 2: Linear Regression (No Regularization)

What Happens

Step 3: L1 Regularization (Lasso)

Impact

Step 4: L2 Regularization (Ridge)

Impact

Step 5: Side-by-Side Comparison

Step 6: Takeaways

Python Example

No comments:

Post a Comment