🧠 From Customers to Embeddings(Clustering): Building a Deep Learning Lookalike Engine

A blog post on building production-grade customer embeddings using MLP + K-Means + PCA/t-SNE evaluation.

🎯 The Problem

A retail bank needed to identify lookalike prospects — people who behave like their best existing customers, but aren't customers yet.

Traditional approach: Manually pick a few features (income, age, credit score) and match prospects.

Problem: Too simplistic. Real customer behavior has hundreds of subtle signals — spending patterns, transaction frequency, digital engagement, life stage — that simple matching misses.

Goal: Build a system that learns a rich, compact representation of each customer's behavior, then uses it for prospect targeting and segmentation at scale.

🏗️ The Solution: Deep Learning Embeddings

Instead of predicting yes/no directly, train a neural network to produce a 32-dimensional fingerprint per customer. Use those embeddings for multiple downstream tasks.

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│   Raw features  →  [Neural Net]  →  32-dim vector        │
│                                          ↓               │
│                            Used for: prediction,         │
│                            segmentation, similarity      │
│                                                          │
└──────────────────────────────────────────────────────────┘

🧠 The Architecture: MLP (Multi-Layer Perceptron)

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  INPUT LAYER       HIDDEN LAYERS     EMBEDDING   OUTPUT  │
│  (Features)        (Learn patterns)  (32-dim)    (0/1)   │
│                                                          │
│       ●─┐                                                │
│       ●─┼──→ ●●●●● ──→ ●●●● ──→ [32 nums] ──→  Yes/No   │
│       ●─┤    ●●●●●     ●●●●         ↑                    │
│       ●─┘                            │                   │
│                                  This is what            │
│                                  we extract              │
│                                                          │
└──────────────────────────────────────────────────────────┘

What Each Layer Represents

Layer	What It Does
Input	Raw features (numerical + encoded categoricals)
Hidden 1	Learns basic feature interactions
Hidden 2	Learns higher-order behavior patterns
Embedding	Compressed "behavior signature" ⭐
Output	Binary classification (converts or not)

Why MLP?

✅ Produces reusable embeddings (XGBoost gives predictions, not representations)
✅ Captures non-linear interactions between features
✅ Embeddings work for multiple downstream tasks

🎯 Training Strategy

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  1. LOSS FUNCTION:  Binary Cross-Entropy                 │
│     → Standard for binary classification                 │
│                                                          │
│  2. CLASS IMBALANCE: Weighted sampling                   │
│     → Conversion events are rare                         │
│                                                          │
│  3. DATA SPLITS:    Train + 2 OOT (Out-of-Time)          │
│     → Validates model holds up next month                │
│                                                          │
│  4. OPTIMIZER:      Adam (adaptive learning rate)        │
│     → Works well for MLPs                                │
│                                                          │
│  5. REGULARIZATION: Dropout + Early Stopping             │
│     → Prevents overfitting                               │
│                                                          │
└──────────────────────────────────────────────────────────┘

📊 Evaluation: Three Lenses

The most important part — and what most people skip!

🎯 Lens 1: Predictive Performance

Standard classification metrics on Train + Out-of-Time data:

AUC → ranking power (predicted positive ranked above negative)
KS → maximum separation between classes

✅ Train ≈ OOT = no overfitting ✅ AUC stable across time = generalizes well

🎯 Lens 2: Embedding Quality via Clustering

"Did the network learn MEANINGFUL representations or just noise?"

If embeddings are good, K-Means should find natural clusters.

cluster_evaluation.pyv4

# Try K = 2 to 14
for k in range(2, 15):
    model = KMeans(k=k, seed=42, maxIter=20).fit(embeddings)
    
    inertia = model.summary.trainingCost          # Lower = tighter
    silhouette = evaluator.evaluate(model)        # Higher = better

Two metrics, side-by-side:

Code

   Inertia (Elbow)         Silhouette Score
       │●                      │      ●●●  ← PEAK at K=6
       │ ●                     │     ●   ●
       │  ●●                   │    ●     ●
       │     ●●                │   ●       ●●
       │        ●●●● ← ELBOW   │  ●           ●
       │             ●●●●●     │ ●              ●●
       └──────────────── K     └─────────────────── K
       2  4  6  8  10          2  4  6  8  10

   Both point to K=6 → strong confidence ✅

🎯 Lens 3: Visualization

Can't trust numbers alone — see the clusters with your eyes.

visualization.pyv3

# Sample for visualization (can't plot all rows)
sample = embeddings.sample(50_000)

# PCA → global structure (linear, fast)
pca_2d = PCA(n_components=2).fit_transform(sample)

Why both?

PCA shows the big picture (overall variance)
t-SNE shows fine-grained neighborhoods

Together they confirm clusters are real, not artifacts.

🎯 Cheat Sheet: One-Liner Recall

Memorize these for any future clustering question:

Concept	One-Line Recall
K-Means	"Pick K random centroids, assign points to nearest centroid, move centroids to mean of their points, repeat until stable."
Elbow Method	"Plot inertia vs K — pick the K where adding more clusters stops giving big improvements (the bend)."
Silhouette Score	"How close a point is to its OWN cluster vs the NEAREST OTHER cluster — ranges -1 to +1, higher means better-separated clusters."
PCA	"Linear projection that compresses data into directions of maximum variance — shows GLOBAL structure."
t-SNE	"Non-linear projection that preserves local neighborhoods — shows fine-grained LOCAL structure for visualization."

📊 Quick Acceptance Thresholds

Metric	✅ Accept
Silhouette (toy data)	> 0.7
Silhouette (business data)	> 0.25
Silhouette (embeddings)	> 0.15
Elbow	Clear bend visible
PCA explained variance	First 2 PCs > 40%

📉 Production Monitoring: PSI

Once deployed, monitor each embedding dimension monthly:

Code

PSI < 0.10    →  ✅ Stable
PSI 0.10-0.25 →  ⚠️ Slight drift, monitor
PSI > 0.25    →  🚨 Retrain!

If certain dimensions drift heavily, the model's understanding of customer behavior has shifted — time to retrain.

🎯 The Complete Pipeline

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  1. TRAIN MLP                                            │
│     → Binary cross-entropy loss                          │
│     → Extract embedding layer                            │
│                  ↓                                       │
│  2. EVALUATE PREDICTIONS                                 │
│     → AUC, KS on train + OOT data                        │
│                  ↓                                       │
│  3. VALIDATE EMBEDDINGS                                  │
│     → K-Means (K=2 to 14)                                │
│     → Elbow + Silhouette → pick optimal K                │
│                  ↓                                       │
│  4. VISUALIZE                                            │
│     → Sample points, project to 2D                       │
│     → PCA (global) + t-SNE (local)                       │
│                  ↓                                       │
│  5. DEPLOY + MONITOR                                     │
│     → PSI per embedding dimension                        │
│     → Retrain if drift detected                          │
│                                                          │
└──────────────────────────────────────────────────────────┘

💡 Why This Approach Works

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  TRADITIONAL ML:                                         │
│  Features → Model → Prediction                           │
│  (Predictions only, single use)                          │
│                                                          │
│  EMBEDDING APPROACH:                                     │
│  Features → Model → EMBEDDING → Many uses                │
│                       ↓                                  │
│                  • Prediction                            │
│                  • Segmentation                          │
│                  • Lookalike matching                    │
│                  • Transfer to other models              │
│                                                          │
│  ONE MODEL → MULTIPLE BUSINESS APPLICATIONS              │
│                                                          │
└──────────────────────────────────────────────────────────┘

🎯 Key Takeaways

Code

✅ MLP produces reusable embeddings, not just predictions
✅ Embedding quality validated through 3 lenses:
   AUC (prediction) + Silhouette (structure) + Visualization (sanity)
✅ Pick K with Elbow + Silhouette together
✅ Real embeddings have lower silhouette than toy data
✅ PSI monitors each dimension for production drift
✅ The embedding becomes the foundation for many use cases

📝 The 30-Second Pitch (For Interviews)

"I built a customer embedding pipeline using a Multi-Layer Perceptron trained on conversion prediction. The model produces compact vectors per customer that capture rich behavioral patterns. I validated embedding quality through three lenses: prediction accuracy (AUC + KS), cluster structure via K-Means with Elbow + Silhouette, and visualization via PCA and t-SNE. PSI monitors each embedding dimension in production to catch drift early. The same embeddings power multiple downstream uses: prediction, segmentation, and lookalike modeling."

💡 The real insight: Most ML projects produce predictions and stop. By extracting embeddings, one model powers many business applications — that's the difference between a model and a platform. 🚀

Bigdata and data science by Kartheek Dachepalli

Tuesday, June 23, 2026