🚀 Automated Feature Selection at Scale: A Practical Blueprint
A deep dive into how modern AutoML pipelines turn thousands of raw features into 10–20 of the strongest — with the inner workings of variable clustering finally explained clearly.
🎯 The Problem
When you start a real ML project, you often have thousands of features waiting in feature stores.
Picking the right 10–20 manually means:
- Hours of EDA per feature
- Subjective domain bias
- Missed gold features hiding in low-importance columns
- Multicollinearity that quietly tanks your model
An automated feature selection pipeline solves this — turning 5,000+ features into a curated, stable, non-redundant set ready for XGBoost.
🏗️ The 7-Step Pipeline
📋 Step-by-Step Walkthrough
Step 1: Build the Combined Dataset
Sample ~11% of data from train + 2 Out-of-Time (OOT) snapshots and union them.
Why? Forces features to be stable across multiple time periods — not just lucky on training data.
Step 2 & 3: Two Passes of Random Forest Importance
Build a Random Forest (~10,000 trees) and capture per-feature importance.
Why two passes? Random splits in a Forest introduce noise. Two passes filter out noisy ranks.
Step 4: Univariate + Bivariate Stats (Parallel)
For each surviving feature, compute these metrics in parallel batches:
| Metric | One-Liner |
|---|---|
| IV (Information Value) | How much this feature predicts the target |
| PSI | Has this feature's distribution drifted over time? |
| CSI | Has this feature's relationship to target drifted? |
| WoE | Per-bucket signal — reused later for clustering |
Step 5: Hard Filters
Apply strict cutoffs:
| Filter | Why |
|---|---|
PSI > 0 | Avoid dead/constant features |
PSI < 0.1 | Distribution stable across time |
CSI < 0.1 | Target relationship stable |
IV > 0 | Has predictive signal |
IV < 5.0 | Not target leakage |
Note: Using 0.1 (stricter) instead of the industry "alert" threshold of 0.25 is intentional — we're selecting features for a brand-new model, not just monitoring an existing one.
Step 6: Variable Clustering — The Star of the Show 🌳
Even after Steps 1-5, your surviving features still have a hidden problem: multicollinearity.
Many features are correlated and carry redundant signal. We need to group them and pick the best from each group.
The industry-standard tool here is hierarchical variable clustering (often called VarClusHi or VARCLUS, originally from SAS).
🧠 Deep Dive: How Variable Clustering Actually Works
This algorithm clusters features (columns) — not data points (rows). This is the opposite of K-Means, which clusters rows.
The Mental Flip
🎯 The Two PCA Ingredients Variable Clustering Uses
Before splitting a cluster, the algorithm runs PCA on it. PCA gives two things:
1️⃣ Eigenvalues — "How LOUD is each signal?"
Imagine running PCA on a cluster of 5 features. You get something like:
Eigenvalue = volume knob. High λ means an important signal. Low λ means noise.
Visual:
2️⃣ Loadings — "Which feature contributes to which signal?"
Loadings are a matrix that show how strongly each feature aligns with each PC:
Loading = magnetism. Bigger absolute loading = that feature fits that signal better.
🌳 The Algorithm
🎨 Worked Example
Start: 5 features in one cluster
Run PCA on this cluster:
Use loadings to assign features to sub-clusters:
Result after one split:
Recursively check each new cluster:
Final hierarchy:
🎯 Why Variable Clustering Over K-Means?
| Aspect | K-Means | Variable Clustering |
|---|---|---|
| Clusters | Rows (customers) | Columns (features) |
| Distance metric | Euclidean | Correlation / R² |
| Need to set K? | YES | NO (auto-determined) |
| Output | Flat clusters | Hierarchical tree |
| Use case | Customer segmentation | Multicollinearity removal |
Bottom line: K-Means doesn't even speak the same language as feature selection. It clusters points in space; variable clustering groups columns by correlation. Different problems, different tools.
Step 7: Top 2 Features Per Cluster
From each cluster, pick the top 2 features using RS_Ratio:
Why 2 features instead of 1? Backup. If one fails in production OOT, the other carries the cluster's signal.
📊 The Filtering Funnel
💡 The Three Reasons This Pipeline Works
🎯 Key Takeaways
✅ Automated feature selection cuts thousands of features down to 20 in minutes ✅ Random Forest ranks raw importance ✅ IV, PSI, CSI filter for predictive power and stability ✅ Variable clustering removes multicollinearity via hierarchical PCA-based feature grouping ✅ Eigenvalues (λ₂ > 0.7) decide WHEN to split a cluster ✅ Loadings decide WHICH features join which sub-cluster ✅ Top 2 per cluster preserves signal while killing redundancy
No comments:
Post a Comment