Showing posts with label Regularization. Show all posts
Showing posts with label Regularization. Show all posts

Saturday, August 9, 2025

Explaining L1, L2 Regularization with realtime example



L1 vs L2 Regularization — A Simple Hands-On Guide

When training regression models, you might run into overfitting — where the model learns patterns from noise instead of real trends.
Two popular techniques to combat this are L1 (Lasso) and L2 (Ridge) regularization.

In this post, we’ll walk through a small dataset and see, step by step, how these methods impact:

  • Model weights

  • Feature selection

  • Performance


The Plan

We’ll explore:

  1. Dataset creation and setup

  2. Linear Regression without regularization

  3. L1 Regularization (Lasso)

  4. L2 Regularization (Ridge)

  5. Side-by-side comparison

  6. Key takeaways + Python code


Step 1: Our Toy Dataset

We’ll make a small synthetic dataset with some useful features and some noise.

Features:

  • X1: Strong correlation with target (important)

  • X2: Weak correlation (partially relevant)

  • X3, X4: Noise features (irrelevant)

Target (Y): A linear combination of X1 and X2 plus a little noise.

X1 X2 X3 X4 Y
1.0 2.0 -0.5 1.2 4.5
2.0 0.8 3.0 0.5 5.0
1.5 1.5 -1.0 2.3 5.5
2.2 1.0 0.2 3.1 6.8
3.0 2.5 -1.5 1.5 9.0

Step 2: Linear Regression (No Regularization)

A plain linear regression model minimizes the Mean Squared Error (MSE):

Loss=1n(YY^)2Loss = \frac{1}{n} \sum (Y - \hat{Y})^2

What Happens

  • It fits weights to all features.

  • Even irrelevant ones get non-zero weights (overfitting risk).

Feature Weight
X1 2.5
X2 1.3
X3 0.8
X4 0.6

Observation: Noise features (X3, X4) are influencing predictions.


Step 3: L1 Regularization (Lasso)

L1 adds a penalty on the absolute value of weights:

Loss=1n(YY^)2+wiLoss = \frac{1}{n} \sum (Y - \hat{Y})^2 + \lambda \sum |w_i|

Impact

  • Encourages sparsity: some weights become exactly zero.

  • Effectively performs feature selection.

Feature Weight
X1 2.4
X2 1.2
X3 0.0
X4 0.0

Observation: Irrelevant features are dropped completely.


Step 4: L2 Regularization (Ridge)

L2 adds a penalty on the squared value of weights:

Loss=1n(YY^)2+wi2Loss = \frac{1}{n} \sum (Y - \hat{Y})^2 + \lambda \sum w_i^2

Impact

  • Shrinks weights towards zero, but never fully removes them.

  • Reduces the influence of less important features.

Feature Weight
X1 2.2
X2 1.1
X3 0.3
X4 0.2

Observation: All features remain, but noise features have smaller weights.


Step 5: Side-by-Side Comparison

Aspect No Reg. L1 (Lasso) L2 (Ridge)
X1 Weight 2.5 2.4 2.2
X2 Weight 1.3 1.2 1.1
X3 Weight 0.8 0.0 0.3
X4 Weight 0.6 0.0 0.2
Overfitting Risk High Low Low
Feature Selection No Yes No

Step 6: Takeaways

  • No Regularization: Risks overfitting; all features get weights.

  • L1 (Lasso): Best when you want feature selection; creates sparse models.

  • L2 (Ridge): Best when all features matter but need their effects controlled.


Python Example

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
import numpy as np

# Data
X = np.array([[1.0, 2.0, -0.5, 1.2],
              [2.0, 0.8, 3.0, 0.5],
              [1.5, 1.5, -1.0, 2.3],
              [2.2, 1.0, 0.2, 3.1],
              [3.0, 2.5, -1.5, 1.5]])
y = np.array([4.5, 5.0, 5.5, 6.8, 9.0])

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Models
lr = LinearRegression().fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
ridge = Ridge(alpha=0.1).fit(X_train, y_train)

# Output Weights
print("Linear:", lr.coef_)
print("Lasso :", lasso.coef_)
print("Ridge :", ridge.coef_)

馃挕 In short:

  • Use Lasso if you want to automatically drop irrelevant features.

  • Use Ridge if you want to keep all features but control their influence.

  • Try Elastic Net (L1 + L2) if you want the best of both worlds.



Regularization

L1 vs L2 Regularization — The Complete Guide (with Elastic Net)

When building machine learning models, it’s easy to fall into the overfitting trap — where your model learns noise instead of real patterns.
Regularization is one of the best ways to fight this.

Two of the most widely used regularization techniques are:

  • L1 Regularization (Lasso)

  • L2 Regularization (Ridge)

Both add a penalty term to the loss function, discouraging overly complex models. Let’s break them down.


1. L1 Regularization (Lasso)

Definition:
Adds the absolute value of the weights as a penalty term to the loss function:

Loss=Original_Loss+iwiLoss = Original\_Loss + \lambda \sum_{i} |w_i|

Where:

  • wiw_i = weight of the i-th feature

  • \lambda = regularization strength (higher = more penalty)

Key Characteristics:

  • Encourages sparsity (many weights become exactly zero)

  • Naturally performs feature selection

  • Works best when only a subset of features is truly relevant

When to Use:

  • High-dimensional datasets (e.g., text classification, genetics)

  • When you expect many features to be irrelevant

Example:
Predicting house prices with 100 features → L1 might keep only the 10 most important ones (e.g., square footage, location) and set the rest to zero.


2. L2 Regularization (Ridge)

Definition:
Adds the squared value of the weights as a penalty term to the loss function:

Loss=Original_Loss+iwi2Loss = Original\_Loss + \lambda \sum_{i} w_i^2

Key Characteristics:

  • Encourages small weights (closer to zero but not exactly zero)

  • Reduces the influence of any single feature without removing it entirely

  • Works best when all features are useful

When to Use:

  • You believe all features have some predictive power

  • You want to avoid overfitting but keep every feature in play

  • Useful for correlated features

Example:
Predicting house prices → All features (square footage, bedrooms, bathrooms, etc.) contribute, but L2 ensures no single one dominates.


3. Side-by-Side: L1 vs L2

Aspect L1 (Lasso) L2 (Ridge)
Penalty Term ( \lambda \sum w_i
Effect on Weights Many become exactly zero All become small, non-zero
Feature Selection ✅ Yes ❌ No
Optimization Harder (non-differentiable at zero) Easier (fully differentiable)
Best For Sparse models, irrelevant features Regularizing all features

4. Elastic Net — The Best of Both Worlds

Elastic Net combines L1 and L2 penalties:

Loss=Original_Loss+wi+(1)wi2Loss = Original\_Loss + \alpha \lambda \sum |w_i| + (1-\alpha) \lambda \sum w_i^2

Why use it?

  • Retains the feature selection benefits of L1

  • Keeps the weight shrinkage benefits of L2

  • Especially helpful when features are correlated


5. Visual Intuition

  • L1 (Lasso): Diamond-shaped constraint → optimization often lands on corners → many weights exactly zero (sparse solution)

  • L2 (Ridge): Circular constraint → optimization lands inside → all weights small, none zero


6. Choosing the Right Regularization

Use L1 when:

  • You want a sparse model

  • You expect many irrelevant features

  • You need automatic feature selection

Use L2 when:

  • All features likely matter

  • You want to control coefficient size without removing features

  • You have multicollinearity (correlated features)

Use Elastic Net when:

  • You want a mix of sparsity + stability

  • You have many correlated features

  • You want to avoid L1’s instability on correlated data


7. Python Implementation

from sklearn.linear_model import Lasso, Ridge, ElasticNet

# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.1)  # alpha = 位
lasso.fit(X_train, y_train)

# L2 Regularization (Ridge)
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)

# Elastic Net
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio balances L1/L2
elastic_net.fit(X_train, y_train)

8. Summary Table

Regularization Main Effect Removes Features? Best For
L1 Sparse weights (zeros) ✅ Yes High-dimensional, irrelevant features
L2 Small, non-zero weights ❌ No All features relevant, control magnitude
Elastic Net Mix of L1 & L2 benefits Partial Correlated features + feature selection

馃挕 Takeaway:

  • Use L1 for feature selection

  • Use L2 for controlling weight magnitude

  • Use Elastic Net for a balanced approach