Saturday, August 9, 2025

Explaining L1, L2 Regularization with realtime example



L1 vs L2 Regularization — A Simple Hands-On Guide

When training regression models, you might run into overfitting — where the model learns patterns from noise instead of real trends.
Two popular techniques to combat this are L1 (Lasso) and L2 (Ridge) regularization.

In this post, we’ll walk through a small dataset and see, step by step, how these methods impact:

  • Model weights

  • Feature selection

  • Performance


The Plan

We’ll explore:

  1. Dataset creation and setup

  2. Linear Regression without regularization

  3. L1 Regularization (Lasso)

  4. L2 Regularization (Ridge)

  5. Side-by-side comparison

  6. Key takeaways + Python code


Step 1: Our Toy Dataset

We’ll make a small synthetic dataset with some useful features and some noise.

Features:

  • X1: Strong correlation with target (important)

  • X2: Weak correlation (partially relevant)

  • X3, X4: Noise features (irrelevant)

Target (Y): A linear combination of X1 and X2 plus a little noise.

X1 X2 X3 X4 Y
1.0 2.0 -0.5 1.2 4.5
2.0 0.8 3.0 0.5 5.0
1.5 1.5 -1.0 2.3 5.5
2.2 1.0 0.2 3.1 6.8
3.0 2.5 -1.5 1.5 9.0

Step 2: Linear Regression (No Regularization)

A plain linear regression model minimizes the Mean Squared Error (MSE):

Loss=1n(YY^)2Loss = \frac{1}{n} \sum (Y - \hat{Y})^2

What Happens

  • It fits weights to all features.

  • Even irrelevant ones get non-zero weights (overfitting risk).

Feature Weight
X1 2.5
X2 1.3
X3 0.8
X4 0.6

Observation: Noise features (X3, X4) are influencing predictions.


Step 3: L1 Regularization (Lasso)

L1 adds a penalty on the absolute value of weights:

Loss=1n(YY^)2+λwiLoss = \frac{1}{n} \sum (Y - \hat{Y})^2 + \lambda \sum |w_i|

Impact

  • Encourages sparsity: some weights become exactly zero.

  • Effectively performs feature selection.

Feature Weight
X1 2.4
X2 1.2
X3 0.0
X4 0.0

Observation: Irrelevant features are dropped completely.


Step 4: L2 Regularization (Ridge)

L2 adds a penalty on the squared value of weights:

Loss=1n(YY^)2+λwi2Loss = \frac{1}{n} \sum (Y - \hat{Y})^2 + \lambda \sum w_i^2

Impact

  • Shrinks weights towards zero, but never fully removes them.

  • Reduces the influence of less important features.

Feature Weight
X1 2.2
X2 1.1
X3 0.3
X4 0.2

Observation: All features remain, but noise features have smaller weights.


Step 5: Side-by-Side Comparison

Aspect No Reg. L1 (Lasso) L2 (Ridge)
X1 Weight 2.5 2.4 2.2
X2 Weight 1.3 1.2 1.1
X3 Weight 0.8 0.0 0.3
X4 Weight 0.6 0.0 0.2
Overfitting Risk High Low Low
Feature Selection No Yes No

Step 6: Takeaways

  • No Regularization: Risks overfitting; all features get weights.

  • L1 (Lasso): Best when you want feature selection; creates sparse models.

  • L2 (Ridge): Best when all features matter but need their effects controlled.


Python Example

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
import numpy as np

# Data
X = np.array([[1.0, 2.0, -0.5, 1.2],
              [2.0, 0.8, 3.0, 0.5],
              [1.5, 1.5, -1.0, 2.3],
              [2.2, 1.0, 0.2, 3.1],
              [3.0, 2.5, -1.5, 1.5]])
y = np.array([4.5, 5.0, 5.5, 6.8, 9.0])

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Models
lr = LinearRegression().fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
ridge = Ridge(alpha=0.1).fit(X_train, y_train)

# Output Weights
print("Linear:", lr.coef_)
print("Lasso :", lasso.coef_)
print("Ridge :", ridge.coef_)

💡 In short:

  • Use Lasso if you want to automatically drop irrelevant features.

  • Use Ridge if you want to keep all features but control their influence.

  • Try Elastic Net (L1 + L2) if you want the best of both worlds.



No comments:

Post a Comment