📚 The Data Scientist's Algorithm Bible
Your one-stop reference for picking the right ML algorithm, knowing the right metrics, and acing any interview — built to industry standards.
🎯 1. THE BIG PICTURE — VISUAL SUMMARY
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ QUESTION │ ANSWER │ USE THIS │
│ ──────────────────────┼───────────────┼───────────────── │
│ Yes/No prediction? │ Classify │ XGBoost ⭐ │
│ $ amount prediction? │ Regression │ XGBoost ⭐ │
│ Find groups? │ Cluster │ K-Means ⭐ │
│ Find weird points? │ Anomaly │ Isolation Forest ⭐ │
│ Reduce features? │ Dim. Reduce │ PCA ⭐ │
│ Visualize data? │ Dim. Reduce │ t-SNE / UMAP │
│ Image classification? │ Deep Learn │ CNN ⭐ │
│ Text classification? │ Deep Learn │ Transformers (BERT) ⭐│
│ Recommend items? │ Reco System │ Matrix Factorization │
│ Sequence prediction? │ Time Series │ LSTM / Prophet / ARIMA│
│ Generate text/image? │ Generative │ LLMs / GANs / Diffusion│
│ Reusable embeddings? │ Deep Learn │ MLP / Autoencoder │
│ │
└──────────────────────────────────────────────────────────────────┘
🎓 2. ALGORITHM SELECTION CHEAT SHEET
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ WHICH ALGORITHM TO PICK? │
│ │
│ Do you have LABELS? │
│ │ │
│ ┌───┴───┐ │
│ │ │ │
│ YES NO │
│ │ │ │
│ │ ├─→ Want to GROUP data? │
│ │ │ └─→ K-Means / DBSCAN / GMM / Hierarchical │
│ │ │ │
│ │ ├─→ Want to FIND OUTLIERS? │
│ │ │ └─→ Isolation Forest / LOF / One-Class SVM │
│ │ │ │
│ │ ├─→ Want to REDUCE FEATURES? │
│ │ │ └─→ PCA / t-SNE / UMAP / Autoencoder │
│ │ │ │
│ │ └─→ Want EMBEDDINGS? │
│ │ └─→ Autoencoder / Word2Vec / MLP │
│ │ │
│ ├─→ Target is CONTINUOUS ($)? │
│ │ ├─→ Simple, interpretable: Linear / Ridge / Lasso │
│ │ ├─→ Best general: XGBoost / LightGBM ⭐ │
│ │ ├─→ Many features: Random Forest │
│ │ └─→ Deep patterns: Neural Network │
│ │ │
│ └─→ Target is CATEGORICAL (Yes/No)? │
│ ├─→ Simple, interpretable: Logistic Regression │
│ ├─→ Best general: XGBoost / LightGBM ⭐ │
│ ├─→ Image data: CNN ⭐ (ResNet, EfficientNet) │
│ ├─→ Text data: Transformers (BERT, RoBERTa) ⭐ │
│ ├─→ Audio data: CNN / Wav2Vec │
│ └─→ Time series: LSTM / Transformer │
│ │
└──────────────────────────────────────────────────────────────────┘
🏆 3. METRICS BY MODEL TYPE
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ MODEL TYPE │ METRICS │ WHY? │
│ ───────────────────┼───────────────────┼───────────── │
│ Binary Classif. │ AUC, F1, KS │ Imbalance-proof │
│ Multi-class │ F1, Accuracy │ Per-class fairness │
│ Regression │ R², MAE, RMSE │ Error in real units │
│ Clustering │ Silhouette │ No labels needed │
│ Anomaly Detection │ Precision@K │ Top-K matters │
│ Dim. Reduction │ Explained Var. │ Info retained │
│ Recommendation │ NDCG, MAP │ Ranking quality │
│ Time Series │ MAPE, MAE │ Scale-free │
│ Image Classif. │ Top-1/Top-5 Acc │ CNN standard │
│ Object Detection │ mAP @ IoU │ Box overlap accuracy │
│ NLP Generation │ BLEU, ROUGE │ Text overlap │
│ │
└──────────────────────────────────────────────────────────────────┘
🎯 4. INDUSTRY-ACCEPTED THRESHOLDS
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ METRIC │ ❌ Bad │ ⚠️ OK │ ✅ Good │ 🏆 Great │
│ ────────────────┼─────────┼──────────┼──────────┼────────── │
│ AUC │ < 0.6 │ 0.6-0.7 │ 0.7-0.85│ > 0.85 │
│ F1 │ < 0.5 │ 0.5-0.7 │ 0.7-0.85│ > 0.85 │
│ KS │ < 20 │ 20-40 │ 40-60 │ > 60 │
│ R² │ < 0.3 │ 0.3-0.5 │ 0.5-0.8 │ > 0.8 │
│ Silhouette │ < 0.15 │ 0.15-0.3│ 0.3-0.5 │ > 0.5 │
│ Precision@K │ < 5% │ 5-15% │ 15-30% │ > 30% │
│ Top-1 Acc (img) │ < 60% │ 60-75% │ 75-90% │ > 90% │
│ mAP @ IoU=0.5 │ < 0.3 │ 0.3-0.5 │ 0.5-0.7 │ > 0.7 │
│ PSI / CSI │ > 0.25 │ 0.10-0.25│ < 0.10 │ < 0.05 │
│ IV (univariate) │ < 0.02 │ 0.02-0.1 │ 0.1-0.3 │ > 0.3 │
│ VIF │ > 10 │ 5-10 │ 2-5 │ < 2 │
│ │
│ Drift Alert: PSI > 0.25 → RETRAIN MODEL 🚨 │
│ │
└──────────────────────────────────────────────────────────────────┘
📊 5. DATA PREPROCESSING BIBLE
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ STEP │ WHEN │ TECHNIQUE │
│ ──────────────────┼────────────────────────┼──────────── │
│ Missing Values │ NaNs in data │ Median / KNN │
│ Scaling │ Linear / KNN / NN │ StandardScaler │
│ │ Tree models │ Not needed │
│ Encoding │ Low cardinality │ OneHot │
│ │ High cardinality │ Target Encoding │
│ Outliers │ Linear sensitive │ IQR / Z-score │
│ Class Imbalance │ Rare positive class │ SMOTE / Class wts │
│ Train/Test Split │ Random data │ 80/20 random │
│ │ Time series │ Temporal split │
│ Cross Validation │ Tuning hyperparams │ K-Fold (5) │
│ │ Time series │ TimeSeriesCV │
│ │
└──────────────────────────────────────────────────────────────────┘
🔬 6. FEATURE SELECTION & ELIMINATION (Critical Step!)
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ TECHNIQUE │ PURPOSE │ THRESHOLD │
│ ──────────────────┼───────────────────────┼───────────── │
│ IV (Info Value) │ Predictive strength │ Keep IV > 0.10 │
│ │ for binary target │ Drop IV < 0.02 │
│ WoE Transform │ Bin-level signal │ Used with IV │
│ VIF │ Detect redundancy │ Drop VIF > 10 │
│ │ (multicollinearity) │ Keep VIF < 5 │
│ Correlation │ Pairwise overlap │ Drop if |r| > 0.85 │
│ Mutual Info │ Non-linear signal │ Keep top-K │
│ Permutation Imp. │ Drop in accuracy │ Universal method │
│ │ when feature shuffled│ │
│ SHAP Importance │ Feature contribution │ Industry default ⭐│
│ Recursive (RFE) │ Backward elimination │ For small datasets │
│ │
│ TYPICAL ORDER: │
│ 1. IV (univariate signal) → drop weak features │
│ 2. VIF (redundancy) → drop correlated features │
│ 3. SHAP (final ranking) → keep top contributors │
│ │
└──────────────────────────────────────────────────────────────────┘
⚙️ 7. HYPERPARAMETER TUNING GUIDE
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ ALGORITHM │ KEY HYPERPARAMETERS │
│ ──────────────────┼────────────────────────────── │
│ XGBoost │ n_estimators, learning_rate, │
│ │ max_depth, subsample, colsample_bytree │
│ Random Forest │ n_estimators, max_depth │
│ Logistic Reg. │ C (regularization), penalty │
│ Neural Network │ layers, units, dropout, learning_rate │
│ CNN │ filters, kernel_size, augmentations │
│ K-Means │ n_clusters (K), max_iter │
│ DBSCAN │ eps, min_samples │
│ Isolation Forest │ n_estimators, contamination │
│ PCA │ n_components │
│ │
│ ⭐ INDUSTRY DEFAULT TUNING TOOL: Hyperopt / Optuna │
│ (Bayesian optimization — fast + best quality) │
│ │
└──────────────────────────────────────────────────────────────────┘
⚖️ 8. BIAS, FAIRNESS & EXPLAINABILITY
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ CONCERN │ WHAT TO CHECK │ TOOL │
│ ───────────────────┼───────────────────────────┼──────── │
│ Bias │ Performance per subgroup │ Group AUC │
│ Fairness │ Equal selection rates │ Demo. Parity │
│ Explainability │ Why this prediction? │ SHAP ⭐ │
│ Data Drift │ Input distribution shift │ PSI │
│ Concept Drift │ Target-feature shift │ CSI │
│ │
│ GOLDEN RULES: │
│ ✅ Never use protected attributes (race, gender) directly │
│ ✅ Watch for PROXY variables (zip code → race) │
│ ✅ Audit across subgroups, not just overall │
│ ✅ Monitor production drift weekly/monthly │
│ │
└──────────────────────────────────────────────────────────────────┘
📈 9. CLASSIFICATION METRICS
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ METRIC │ FORMULA │ USE WHEN │
│ ────────────┼───────────────────────┼────────── │
│ Accuracy │ (TP+TN)/Total │ Balanced data │
│ Precision │ TP/(TP+FP) │ False positives costly │
│ Recall │ TP/(TP+FN) │ False negatives costly │
│ F1 │ 2·P·R/(P+R) │ Imbalanced data │
│ AUC-ROC │ Area under ROC │ Ranking quality ⭐ │
│ AUC-PR │ Area under P-R │ Severe imbalance │
│ Log Loss │ -Σ y·log(p) │ Probabilistic models │
│ KS │ max(TPR - FPR) │ Credit risk │
│ │
│ CONFUSION MATRIX: │
│ │
│ │ Predicted YES │ Predicted NO │
│ ─────────────────┼──────────────────┼────────────── │
│ Actual YES │ TP (✅) │ FN (😭 missed) │
│ Actual NO │ FP (😅 wrong) │ TN (✅) │
│ │
└──────────────────────────────────────────────────────────────────┘
📈 10. REGRESSION METRICS
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ METRIC │ WHAT IT MEASURES │ USE WHEN │
│ ────────────┼──────────────────────────┼────────── │
│ MAE │ Avg absolute error │ Interpretable │
│ │ Same unit as target │ Outliers present │
│ MSE │ Avg squared error │ Used internally │
│ RMSE │ √MSE — same unit │ DEFAULT ⭐ │
│ │ Punishes big errors │ │
│ R² │ Variance explained │ Business explanation │
│ │ 1=perfect, 0=mean │ │
│ MAPE │ Avg % error │ Forecasting │
│ │ Bad when target ≈ 0 │ │
│ │
│ WHICH TO USE: │
│ • Default → RMSE │
│ • Interpretable → MAE │
│ • Business explanation → R² │
│ • Forecasting → MAPE │
│ │
└──────────────────────────────────────────────────────────────────┘
🔍 11. CLUSTERING EVALUATION
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ METRIC │ ONE-LINER │
│ ───────────────┼──────────────────── │
│ Silhouette │ How well a point fits its cluster vs others │
│ │ Range: -1 to +1, higher = better │
│ │
│ Inertia │ Sum of squared distances from points to │
│ │ their centroids (the RAW NUMBER) │
│ │ Lower = tighter clusters │
│ │
│ Elbow Method │ The TECHNIQUE of plotting Inertia for │
│ │ multiple K values and picking the "bend" │
│ │
└──────────────────────────────────────────────────────────────────┘
💡 Inertia vs Elbow Method — The Key Difference
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ INERTIA = A SINGLE NUMBER │
│ (one value for one K — e.g., K=5 → inertia=12,345) │
│ │
│ ELBOW METHOD = A TECHNIQUE that USES INERTIA │
│ (plot inertia for K=2,3,4...14 → find the bend) │
│ │
│ Analogy: │
│ Inertia = a thermometer reading (one number) │
│ Elbow Method = the technique of watching readings over time │
│ │
│ USAGE: │
│ 1. Calculate inertia for K=2 to K=15 │
│ 2. Plot inertia vs K (this PLOT = Elbow Method) │
│ 3. Pick K at the "elbow bend" (diminishing returns point) │
│ │
└──────────────────────────────────────────────────────────────────┘
🎨 Visual Example
Code
Inertia
│● ← K=2, very high inertia
│ ●
│ ●
│ ●● ← Big drops (each K adds real value)
│ ●●
│ ●●●●● ← ELBOW! (K=5 or 6 optimal)
│ ●●●●●● ← Tiny drops (diminishing returns)
│ ●●●●●
└───────────────────────── K
2 3 4 5 6 7 8 9 10
🎯 12. THE PRO'S QUICK FACTS
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ 💡 KEY INSIGHTS: │
│ │
│ ✅ XGBoost = best for tabular data (90% of business problems) │
│ ✅ CNN = default for image classification (ResNet, EfficientNet)│
│ ✅ Transformers (BERT) = default for NLP │
│ ✅ LSTM/Prophet = default for time series │
│ ✅ K-Means = default clustering when you know K │
│ ✅ DBSCAN = clustering when you don't know K │
│ ✅ Isolation Forest = default for anomaly detection │
│ ✅ PCA = default for dimensionality reduction │
│ ✅ LightGBM = faster XGBoost for big data │
│ ✅ Hyperopt / Optuna = default for hyperparameter tuning │
│ ✅ MLflow = default for experiment tracking │
│ ✅ SHAP = default for explainability │
│ ✅ PSI / CSI = default for production drift monitoring │
│ ✅ IV + VIF = default for feature selection in credit risk │
│ ✅ Spark/Databricks = default for big data ML │
│ │
│ ⚠️ COMMON PITFALLS: │
│ │
│ ❌ Linear models without scaling │
│ ❌ K-Means without standardization │
│ ❌ Ignoring class imbalance in classification │
│ ❌ Using accuracy on imbalanced data │
│ ❌ Not validating on out-of-time data │
│ ❌ Forgetting to check for data leakage │
│ ❌ Trusting feature importance from correlated features │
│ ❌ Deploying without monitoring (PSI, drift) │
│ ❌ Skipping feature selection (IV/VIF) in regulated domains │
│ ❌ No baseline model before going complex │
│ │
└──────────────────────────────────────────────────────────────────┘
💡 The Three Rules to Live By
Code
┌──────────────────────────────────────────────────────────────────┐
│ │
│ RULE 1: Start simple. Beat baseline before going complex. │
│ │
│ RULE 2: Trust evaluation, not algorithm hype. │
│ A simple model with good evaluation beats a fancy │
│ model with poor validation. │
│ │
│ RULE 3: Production starts at modeling, not after. │
│ Think monitoring, drift, fairness from day 1. │
│ │
└──────────────────────────────────────────────────────────────────┘
No comments:
Post a Comment