L1 vs L2 Regularization — A Simple Hands-On Guide

When training regression models, you might run into overfitting — where the model learns patterns from noise instead of real trends.
Two popular techniques to combat this are L1 (Lasso) and L2 (Ridge) regularization.

In this post, we’ll walk through a small dataset and see, step by step, how these methods impact:

Model weights
Feature selection
Performance

The Plan

We’ll explore:

Dataset creation and setup
Linear Regression without regularization
L1 Regularization (Lasso)
L2 Regularization (Ridge)
Side-by-side comparison
Key takeaways + Python code

Step 1: Our Toy Dataset

We’ll make a small synthetic dataset with some useful features and some noise.

Features:

X1: Strong correlation with target (important)
X2: Weak correlation (partially relevant)
X3, X4: Noise features (irrelevant)

Target (Y): A linear combination of X1 and X2 plus a little noise.

X1	X2	X3	X4	Y
1.0	2.0	-0.5	1.2	4.5
2.0	0.8	3.0	0.5	5.0
1.5	1.5	-1.0	2.3	5.5
2.2	1.0	0.2	3.1	6.8
3.0	2.5	-1.5	1.5	9.0

Step 2: Linear Regression (No Regularization)

A plain linear regression model minimizes the Mean Squared Error (MSE):

$Loss = \frac{1}{n} \sum (Y - \hat{Y})^2$

What Happens

It fits weights to all features.
Even irrelevant ones get non-zero weights (overfitting risk).

Feature	Weight
X1	2.5
X2	1.3
X3	0.8
X4	0.6

✅ Observation: Noise features (X3, X4) are influencing predictions.

Step 3: L1 Regularization (Lasso)

L1 adds a penalty on the absolute value of weights:

$Loss = \frac{1}{n} \sum (Y - \hat{Y})^2 + \lambda \sum |w_i|$

Impact

Encourages sparsity: some weights become exactly zero.
Effectively performs feature selection.

Feature	Weight
X1	2.4
X2	1.2
X3	0.0
X4	0.0

✅ Observation: Irrelevant features are dropped completely.

Step 4: L2 Regularization (Ridge)

L2 adds a penalty on the squared value of weights:

$Loss = \frac{1}{n} \sum (Y - \hat{Y})^2 + \lambda \sum w_i^2$

Impact

Shrinks weights towards zero, but never fully removes them.
Reduces the influence of less important features.

Feature	Weight
X1	2.2
X2	1.1
X3	0.3
X4	0.2

✅ Observation: All features remain, but noise features have smaller weights.

Step 5: Side-by-Side Comparison

Aspect	No Reg.	L1 (Lasso)	L2 (Ridge)
X1 Weight	2.5	2.4	2.2
X2 Weight	1.3	1.2	1.1
X3 Weight	0.8	0.0	0.3
X4 Weight	0.6	0.0	0.2
Overfitting Risk	High	Low	Low
Feature Selection	No	Yes	No

Step 6: Takeaways

No Regularization: Risks overfitting; all features get weights.
L1 (Lasso): Best when you want feature selection; creates sparse models.
L2 (Ridge): Best when all features matter but need their effects controlled.

Python Example

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
import numpy as np

# Data
X = np.array([[1.0, 2.0, -0.5, 1.2],
              [2.0, 0.8, 3.0, 0.5],
              [1.5, 1.5, -1.0, 2.3],
              [2.2, 1.0, 0.2, 3.1],
              [3.0, 2.5, -1.5, 1.5]])
y = np.array([4.5, 5.0, 5.5, 6.8, 9.0])

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Models
lr = LinearRegression().fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
ridge = Ridge(alpha=0.1).fit(X_train, y_train)

# Output Weights
print("Linear:", lr.coef_)
print("Lasso :", lasso.coef_)
print("Ridge :", ridge.coef_)

💡 In short:

Use Lasso if you want to automatically drop irrelevant features.
Use Ridge if you want to keep all features but control their influence.
Try Elastic Net (L1 + L2) if you want the best of both worlds.

L1 vs L2 Regularization — The Complete Guide (with Elastic Net)

When building machine learning models, it’s easy to fall into the overfitting trap — where your model learns noise instead of real patterns.
Regularization is one of the best ways to fight this.

Two of the most widely used regularization techniques are:

L1 Regularization (Lasso)
L2 Regularization (Ridge)

Both add a penalty term to the loss function, discouraging overly complex models. Let’s break them down.

1. L1 Regularization (Lasso)

Definition:
Adds the absolute value of the weights as a penalty term to the loss function:

$Loss = Original\_Loss + \lambda \sum_{i} |w_i|$

Where:

$w_i$ = weight of the i-th feature
$\lambda$ = regularization strength (higher = more penalty)

Key Characteristics:

Encourages sparsity (many weights become exactly zero)
Naturally performs feature selection
Works best when only a subset of features is truly relevant

When to Use:

High-dimensional datasets (e.g., text classification, genetics)
When you expect many features to be irrelevant

Example:
Predicting house prices with 100 features → L1 might keep only the 10 most important ones (e.g., square footage, location) and set the rest to zero.

2. L2 Regularization (Ridge)

Definition:
Adds the squared value of the weights as a penalty term to the loss function:

$Loss = Original\_Loss + \lambda \sum_{i} w_i^2$

Key Characteristics:

Encourages small weights (closer to zero but not exactly zero)
Reduces the influence of any single feature without removing it entirely
Works best when all features are useful

When to Use:

You believe all features have some predictive power
You want to avoid overfitting but keep every feature in play
Useful for correlated features

Example:
Predicting house prices → All features (square footage, bedrooms, bathrooms, etc.) contribute, but L2 ensures no single one dominates.

3. Side-by-Side: L1 vs L2

Aspect	L1 (Lasso)	L2 (Ridge)
Penalty Term	( \lambda \sum	w_i
Effect on Weights	Many become exactly zero	All become small, non-zero
Feature Selection	✅ Yes	❌ No
Optimization	Harder (non-differentiable at zero)	Easier (fully differentiable)
Best For	Sparse models, irrelevant features	Regularizing all features

4. Elastic Net — The Best of Both Worlds

Elastic Net combines L1 and L2 penalties:

$Loss = Original\_Loss + \alpha \lambda \sum |w_i| + (1-\alpha) \lambda \sum w_i^2$

Why use it?

Retains the feature selection benefits of L1
Keeps the weight shrinkage benefits of L2
Especially helpful when features are correlated

5. Visual Intuition

L1 (Lasso): Diamond-shaped constraint → optimization often lands on corners → many weights exactly zero (sparse solution)
L2 (Ridge): Circular constraint → optimization lands inside → all weights small, none zero

6. Choosing the Right Regularization

✅ Use L1 when:

You want a sparse model
You expect many irrelevant features
You need automatic feature selection

✅ Use L2 when:

All features likely matter
You want to control coefficient size without removing features
You have multicollinearity (correlated features)

✅ Use Elastic Net when:

You want a mix of sparsity + stability
You have many correlated features
You want to avoid L1’s instability on correlated data

7. Python Implementation

from sklearn.linear_model import Lasso, Ridge, ElasticNet

# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.1)  # alpha = λ
lasso.fit(X_train, y_train)

# L2 Regularization (Ridge)
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)

# Elastic Net
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # l1_ratio balances L1/L2
elastic_net.fit(X_train, y_train)

8. Summary Table

Regularization	Main Effect	Removes Features?	Best For
L1	Sparse weights (zeros)	✅ Yes	High-dimensional, irrelevant features
L2	Small, non-zero weights	❌ No	All features relevant, control magnitude
Elastic Net	Mix of L1 & L2 benefits	Partial	Correlated features + feature selection

💡 Takeaway:

Use L1 for feature selection
Use L2 for controlling weight magnitude
Use Elastic Net for a balanced approach

Bigdata and data science by Kartheek Dachepalli

Saturday, August 9, 2025

Explaining L1, L2 Regularization with realtime example

L1 vs L2 Regularization — A Simple Hands-On Guide

The Plan

Step 1: Our Toy Dataset

Step 2: Linear Regression (No Regularization)

What Happens

Step 3: L1 Regularization (Lasso)

Impact

Step 4: L2 Regularization (Ridge)

Impact

Step 5: Side-by-Side Comparison

Step 6: Takeaways

Python Example

Regularization

L1 vs L2 Regularization — The Complete Guide (with Elastic Net)

1. L1 Regularization (Lasso)

2. L2 Regularization (Ridge)

3. Side-by-Side: L1 vs L2

4. Elastic Net — The Best of Both Worlds

5. Visual Intuition

6. Choosing the Right Regularization

7. Python Implementation

8. Summary Table