🚀 Automated Feature Selection at Scale: A Practical Blueprint

A deep dive into how modern AutoML pipelines turn thousands of raw features into 10–20 of the strongest — with the inner workings of variable clustering finally explained clearly.

🎯 The Problem

When you start a real ML project, you often have thousands of features waiting in feature stores.

Picking the right 10–20 manually means:

Hours of EDA per feature
Subjective domain bias
Missed gold features hiding in low-importance columns
Multicollinearity that quietly tanks your model

An automated feature selection pipeline solves this — turning 5,000+ features into a curated, stable, non-redundant set ready for XGBoost.

🏗️ The 7-Step Pipeline

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  STEP 1 → Build EDA dataset (sample across 3 periods)    │
│  STEP 2 → Random Forest importance (first pass)          │
│  STEP 3 → Random Forest importance (second pass)         │
│  STEP 4 → Compute IV / PSI / CSI in parallel             │
│  STEP 5 → Apply stability + predictive filters           │
│  STEP 6 → Variable clustering (remove redundancy)        │
│  STEP 7 → Pick top 2 features per cluster                │
│                                                          │
└──────────────────────────────────────────────────────────┘

📋 Step-by-Step Walkthrough

Step 1: Build the Combined Dataset

Sample ~11% of data from train + 2 Out-of-Time (OOT) snapshots and union them.

Why? Forces features to be stable across multiple time periods — not just lucky on training data.

Step 2 & 3: Two Passes of Random Forest Importance

Build a Random Forest (~10,000 trees) and capture per-feature importance.

Code

First pass  → drop features with zero importance
Second pass → re-rank surviving features for reliability

Why two passes? Random splits in a Forest introduce noise. Two passes filter out noisy ranks.

Step 4: Univariate + Bivariate Stats (Parallel)

For each surviving feature, compute these metrics in parallel batches:

Metric	One-Liner
IV (Information Value)	How much this feature predicts the target
PSI	Has this feature's distribution drifted over time?
CSI	Has this feature's relationship to target drifted?
WoE	Per-bucket signal — reused later for clustering

Step 5: Hard Filters

Apply strict cutoffs:

SQL

PSI_OOT1 < 0.1 AND PSI_OOT1 > 0   -- Stable, but not constant
AND PSI_OOT2 < 0.1 AND PSI_OOT2 > 0
AND CSI_OOT1 < 0.1 AND CSI_OOT2 < 0.1
AND IV < 5.0 AND IV > 0           -- Predictive, not leakage
AND Importance > 0.0              -- RF confirmed useful

Filter	Why
`PSI > 0`	Avoid dead/constant features
`PSI < 0.1`	Distribution stable across time
`CSI < 0.1`	Target relationship stable
`IV > 0`	Has predictive signal
`IV < 5.0`	Not target leakage

Note: Using 0.1 (stricter) instead of the industry "alert" threshold of 0.25 is intentional — we're selecting features for a brand-new model, not just monitoring an existing one.

Step 6: Variable Clustering — The Star of the Show 🌳

Even after Steps 1-5, your surviving features still have a hidden problem: multicollinearity.

Many features are correlated and carry redundant signal. We need to group them and pick the best from each group.

The industry-standard tool here is hierarchical variable clustering (often called VarClusHi or VARCLUS, originally from SAS).

🧠 Deep Dive: How Variable Clustering Actually Works

This algorithm clusters features (columns) — not data points (rows). This is the opposite of K-Means, which clusters rows.

The Mental Flip

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  Normal view:  Each ROW = customer (data point)          │
│                Each COLUMN = feature                     │
│                                                          │
│  Here:         Each FEATURE = vector of values across    │
│                all customers                             │
│                                                          │
│  Two features are "similar" if their values move         │
│  together across customers (correlation).                │
│                                                          │
└──────────────────────────────────────────────────────────┘

🎯 The Two PCA Ingredients Variable Clustering Uses

Before splitting a cluster, the algorithm runs PCA on it. PCA gives two things:

1️⃣ Eigenvalues — "How LOUD is each signal?"

Imagine running PCA on a cluster of 5 features. You get something like:

Code

PC1: λ₁ = 2.8   ← Main signal (strong)
PC2: λ₂ = 1.5   ← Second signal (also strong!)
PC3: λ₃ = 0.3   ← Noise
PC4: λ₄ = 0.05  ← Noise
PC5: λ₅ = 0.02  ← Noise

Eigenvalue = volume knob. High λ means an important signal. Low λ means noise.

Visual:

Code

λ value
   │
3.0│ ●  ← λ₁ = 2.8 (very loud)
   │
2.0│
   │      ●  ← λ₂ = 1.5 (also loud!)
1.0│
   │
0.5│            ●  ← λ₃ = 0.3 (quiet)
   │               ●  ← λ₄ = 0.05
0.0│                  ●  ← λ₅ = 0.02
   └──────────────────── PC index
       1   2   3   4   5

When λ₂ stays loud (> 0.7), it means there are 
TWO real signals — time to split!

2️⃣ Loadings — "Which feature contributes to which signal?"

Loadings are a matrix that show how strongly each feature aligns with each PC:

Code

                  PC1        PC2
                (Wealth)  (Maturity)
   Income        0.85       0.10
   Savings       0.80       0.15
   NetWorth      0.78       0.05
   Age           0.12       0.88
   Tenure        0.08       0.85

Loading = magnetism. Bigger absolute loading = that feature fits that signal better.

🌳 The Algorithm

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  STEP A: Start with ALL features in ONE cluster          │
│                                                          │
│  STEP B: Run PCA → get eigenvalues + loadings            │
│                                                          │
│  STEP C: Check the 2nd eigenvalue (λ₂):                  │
│                                                          │
│           IF λ₂ > 0.7                                    │
│             → "Multiple signals hiding in this cluster"  │
│             → SPLIT into 2 sub-clusters                  │
│                                                          │
│           IF λ₂ ≤ 0.7                                    │
│             → "Only one signal here"                     │
│             → STOP. This is a final cluster              │
│                                                          │
│  STEP D: To split — each feature joins whichever PC it   │
│           has the HIGHER absolute loading on:            │
│                                                          │
│           Feature → joins PC1 cluster if                 │
│                     |loading_PC1| > |loading_PC2|        │
│                     else joins PC2 cluster               │
│                                                          │
│  STEP E: Recursively repeat on each new sub-cluster      │
│                                                          │
└──────────────────────────────────────────────────────────┘

🎨 Worked Example

Start: 5 features in one cluster

Code

{Income, Savings, NetWorth, Age, Tenure}

Run PCA on this cluster:

Code

λ₁ = 2.8, λ₂ = 1.5, λ₃ = 0.3, λ₄ = 0.05, λ₅ = 0.02

Check: λ₂ = 1.5 > 0.7  →  SPLIT!

Use loadings to assign features to sub-clusters:

Code

Income     PC1=0.85, PC2=0.10  →  |0.85| > |0.10| → PC1 cluster ✅
Savings    PC1=0.80, PC2=0.15  →  |0.80| > |0.15| → PC1 cluster ✅
NetWorth   PC1=0.78, PC2=0.05  →  |0.78| > |0.05| → PC1 cluster ✅
Age        PC1=0.12, PC2=0.88  →  |0.12| < |0.88| → PC2 cluster ✅
Tenure     PC1=0.08, PC2=0.85  →  |0.08| < |0.85| → PC2 cluster ✅

Result after one split:

Code

Cluster 1 (Wealth signal):    {Income, Savings, NetWorth}
Cluster 2 (Maturity signal):  {Age, Tenure}

Recursively check each new cluster:

Code

Cluster 1: PCA → λ₂ = 0.2 → STOP (final cluster)
Cluster 2: PCA → λ₂ = 0.1 → STOP (final cluster)

Final hierarchy:

Code

              ALL FEATURES
                  │
               [Split λ₂=1.5]
                  │
         ┌────────┴─────────┐
         │                  │
   {Income, Savings,    {Age, Tenure}
    NetWorth}
         │                  │
   [Final cluster]    [Final cluster]

🎯 Why Variable Clustering Over K-Means?

Aspect	K-Means	Variable Clustering
Clusters	Rows (customers)	Columns (features)
Distance metric	Euclidean	Correlation / R²
Need to set K?	YES	NO (auto-determined)
Output	Flat clusters	Hierarchical tree
Use case	Customer segmentation	Multicollinearity removal

Bottom line: K-Means doesn't even speak the same language as feature selection. It clusters points in space; variable clustering groups columns by correlation. Different problems, different tools.

Step 7: Top 2 Features Per Cluster

From each cluster, pick the top 2 features using RS_Ratio:

Code

RS_Ratio = (1 - R²_own) / (1 - R²_nearest)

LOWER = BETTER:
  • Feature fits its own cluster strongly
  • Feature differs from other clusters

Why 2 features instead of 1? Backup. If one fails in production OOT, the other carries the cluster's signal.

📊 The Filtering Funnel

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│   ~5,000 features   →  Random Forest Importance          │
│                            ↓                             │
│   ~500 features     →  Drop zero-importance              │
│                            ↓                             │
│                          Parallel IV / PSI / CSI         │
│                            ↓                             │
│   ~200 features     →  Stability + IV filters            │
│                            ↓                             │
│   ~50-100 features  →  Variable clustering               │
│                            ↓                             │
│   ~10-20 features   →  Top 2 per cluster ✅              │
│                                                          │
└──────────────────────────────────────────────────────────┘

💡 The Three Reasons This Pipeline Works

Code

┌──────────────────────────────────────────────────────────┐
│                                                          │
│  1. PREDICTIVE  → IV + Random Forest importance          │
│                   ensure features actually matter        │
│                                                          │
│  2. STABLE      → PSI + CSI across 2 OOT periods         │
│                   ensure features hold up over time      │
│                                                          │
│  3. INDEPENDENT → Variable clustering removes redundant  │
│                   features so each kept feature adds     │
│                   new info                               │
│                                                          │
│  Predictive + Stable + Independent =                     │
│      Production-ready feature set                        │
│                                                          │
└──────────────────────────────────────────────────────────┘

🎯 Key Takeaways

✅ Automated feature selection cuts thousands of features down to 20 in minutes ✅ Random Forest ranks raw importance ✅ IV, PSI, CSI filter for predictive power and stability ✅ Variable clustering removes multicollinearity via hierarchical PCA-based feature grouping ✅ Eigenvalues (λ₂ > 0.7) decide WHEN to split a cluster ✅ Loadings decide WHICH features join which sub-cluster ✅ Top 2 per cluster preserves signal while killing redundancy

Bigdata and data science by Kartheek Dachepalli

Wednesday, June 24, 2026