Tuesday, June 23, 2026

📚 The Data Scientist's Algorithm Bible

 

📚 The Data Scientist's Algorithm Bible

Your one-stop reference for picking the right ML algorithm, knowing the right metrics, and acing any interview — built to industry standards.


🎯 1. THE BIG PICTURE — VISUAL SUMMARY

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│   QUESTION              │  ANSWER       │  USE THIS              │
│   ──────────────────────┼───────────────┼─────────────────       │
│   Yes/No prediction?    │  Classify     │  XGBoost ⭐            │
│   $ amount prediction?  │  Regression   │  XGBoost ⭐            │
│   Find groups?          │  Cluster      │  K-Means ⭐            │
│   Find weird points?    │  Anomaly      │  Isolation Forest ⭐   │
│   Reduce features?      │  Dim. Reduce  │  PCA ⭐                │
│   Visualize data?       │  Dim. Reduce  │  t-SNE / UMAP          │
│   Image classification? │  Deep Learn   │  CNN ⭐                │
│   Text classification?  │  Deep Learn   │  Transformers (BERT) ⭐│
│   Recommend items?      │  Reco System  │  Matrix Factorization  │
│   Sequence prediction?  │  Time Series  │  LSTM / Prophet / ARIMA│
│   Generate text/image?  │  Generative   │  LLMs / GANs / Diffusion│
│   Reusable embeddings?  │  Deep Learn   │  MLP / Autoencoder     │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

🎓 2. ALGORITHM SELECTION CHEAT SHEET

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│                  WHICH ALGORITHM TO PICK?                        │
│                                                                  │
│   Do you have LABELS?                                            │
│        │                                                         │
│    ┌───┴───┐                                                     │
│    │       │                                                     │
│   YES     NO                                                     │
│    │       │                                                     │
│    │       ├─→ Want to GROUP data?                               │
│    │       │     └─→ K-Means / DBSCAN / GMM / Hierarchical       │
│    │       │                                                     │
│    │       ├─→ Want to FIND OUTLIERS?                            │
│    │       │     └─→ Isolation Forest / LOF / One-Class SVM      │
│    │       │                                                     │
│    │       ├─→ Want to REDUCE FEATURES?                          │
│    │       │     └─→ PCA / t-SNE / UMAP / Autoencoder            │
│    │       │                                                     │
│    │       └─→ Want EMBEDDINGS?                                  │
│    │             └─→ Autoencoder / Word2Vec / MLP                │
│    │                                                             │
│    ├─→ Target is CONTINUOUS ($)?                                 │
│    │     ├─→ Simple, interpretable: Linear / Ridge / Lasso       │
│    │     ├─→ Best general: XGBoost / LightGBM ⭐                 │
│    │     ├─→ Many features:  Random Forest                       │
│    │     └─→ Deep patterns: Neural Network                       │
│    │                                                             │
│    └─→ Target is CATEGORICAL (Yes/No)?                           │
│          ├─→ Simple, interpretable: Logistic Regression          │
│          ├─→ Best general: XGBoost / LightGBM ⭐                 │
│          ├─→ Image data: CNN ⭐ (ResNet, EfficientNet)           │
│          ├─→ Text data: Transformers (BERT, RoBERTa) ⭐          │
│          ├─→ Audio data: CNN / Wav2Vec                           │
│          └─→ Time series: LSTM / Transformer                     │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

🏆 3. METRICS BY MODEL TYPE

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  MODEL TYPE         │  METRICS          │  WHY?                  │
│  ───────────────────┼───────────────────┼─────────────           │
│  Binary Classif.    │  AUC, F1, KS      │  Imbalance-proof       │
│  Multi-class        │  F1, Accuracy     │  Per-class fairness    │
│  Regression         │  R², MAE, RMSE    │  Error in real units   │
│  Clustering         │  Silhouette       │  No labels needed      │
│  Anomaly Detection  │  Precision@K      │  Top-K matters         │
│  Dim. Reduction     │  Explained Var.   │  Info retained         │
│  Recommendation     │  NDCG, MAP        │  Ranking quality       │
│  Time Series        │  MAPE, MAE        │  Scale-free            │
│  Image Classif.     │  Top-1/Top-5 Acc  │  CNN standard          │
│  Object Detection   │  mAP @ IoU        │  Box overlap accuracy  │
│  NLP Generation     │  BLEU, ROUGE      │  Text overlap          │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

🎯 4. INDUSTRY-ACCEPTED THRESHOLDS

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  METRIC          │  ❌ Bad  │  ⚠️ OK    │  ✅ Good  │  🏆 Great   │
│  ────────────────┼─────────┼──────────┼──────────┼──────────    │
│  AUC             │  < 0.6  │  0.6-0.7 │  0.7-0.85│  > 0.85       │
│  F1              │  < 0.5  │  0.5-0.7 │  0.7-0.85│  > 0.85       │
│  KS              │  < 20   │  20-40   │  40-60   │  > 60         │
│  R²              │  < 0.3  │  0.3-0.5 │  0.5-0.8 │  > 0.8        │
│  Silhouette      │  < 0.15 │  0.15-0.3│  0.3-0.5 │  > 0.5        │
│  Precision@K     │  < 5%   │  5-15%   │  15-30%  │  > 30%        │
│  Top-1 Acc (img) │  < 60%  │  60-75%  │  75-90%  │  > 90%        │
│  mAP @ IoU=0.5   │  < 0.3  │  0.3-0.5 │  0.5-0.7 │  > 0.7        │
│  PSI / CSI       │  > 0.25 │ 0.10-0.25│  < 0.10  │  < 0.05       │
│  IV (univariate) │  < 0.02 │ 0.02-0.1 │  0.1-0.3 │  > 0.3        │
│  VIF             │  > 10   │  5-10    │  2-5     │  < 2          │
│                                                                  │
│  Drift Alert: PSI > 0.25 → RETRAIN MODEL 🚨                      │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

📊 5. DATA PREPROCESSING BIBLE

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  STEP              │  WHEN                  │  TECHNIQUE          │
│  ──────────────────┼────────────────────────┼────────────         │
│  Missing Values    │  NaNs in data          │  Median / KNN       │
│  Scaling           │  Linear / KNN / NN     │  StandardScaler     │
│                    │  Tree models           │  Not needed         │
│  Encoding          │  Low cardinality       │  OneHot             │
│                    │  High cardinality      │  Target Encoding    │
│  Outliers          │  Linear sensitive      │  IQR / Z-score      │
│  Class Imbalance   │  Rare positive class   │  SMOTE / Class wts  │
│  Train/Test Split  │  Random data           │  80/20 random       │
│                    │  Time series           │  Temporal split     │
│  Cross Validation  │  Tuning hyperparams    │  K-Fold (5)         │
│                    │  Time series           │  TimeSeriesCV       │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

🔬 6. FEATURE SELECTION & ELIMINATION (Critical Step!)

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  TECHNIQUE         │  PURPOSE              │  THRESHOLD          │
│  ──────────────────┼───────────────────────┼─────────────        │
│  IV (Info Value)   │  Predictive strength  │  Keep IV > 0.10     │
│                    │  for binary target    │  Drop IV < 0.02     │
│  WoE Transform     │  Bin-level signal     │  Used with IV       │
│  VIF               │  Detect redundancy    │  Drop VIF > 10      │
│                    │  (multicollinearity)  │  Keep VIF < 5       │
│  Correlation       │  Pairwise overlap     │  Drop if |r| > 0.85 │
│  Mutual Info       │  Non-linear signal    │  Keep top-K         │
│  Permutation Imp.  │  Drop in accuracy     │  Universal method   │
│                    │  when feature shuffled│                     │
│  SHAP Importance   │  Feature contribution │  Industry default ⭐│
│  Recursive (RFE)   │  Backward elimination │  For small datasets │
│                                                                  │
│  TYPICAL ORDER:                                                  │
│  1. IV (univariate signal) → drop weak features                  │
│  2. VIF (redundancy) → drop correlated features                  │
│  3. SHAP (final ranking) → keep top contributors                 │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

⚙️ 7. HYPERPARAMETER TUNING GUIDE

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  ALGORITHM         │  KEY HYPERPARAMETERS                        │
│  ──────────────────┼──────────────────────────────               │
│  XGBoost           │  n_estimators, learning_rate,               │
│                    │  max_depth, subsample, colsample_bytree     │
│  Random Forest     │  n_estimators, max_depth                    │
│  Logistic Reg.     │  C (regularization), penalty                │
│  Neural Network    │  layers, units, dropout, learning_rate      │
│  CNN               │  filters, kernel_size, augmentations        │
│  K-Means           │  n_clusters (K), max_iter                   │
│  DBSCAN            │  eps, min_samples                           │
│  Isolation Forest  │  n_estimators, contamination                │
│  PCA               │  n_components                               │
│                                                                  │
│  ⭐ INDUSTRY DEFAULT TUNING TOOL: Hyperopt / Optuna               │
│     (Bayesian optimization — fast + best quality)                │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

⚖️ 8. BIAS, FAIRNESS & EXPLAINABILITY

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  CONCERN            │  WHAT TO CHECK            │  TOOL          │
│  ───────────────────┼───────────────────────────┼────────         │
│  Bias               │  Performance per subgroup │  Group AUC      │
│  Fairness           │  Equal selection rates    │  Demo. Parity   │
│  Explainability     │  Why this prediction?     │  SHAP ⭐         │
│  Data Drift         │  Input distribution shift │  PSI            │
│  Concept Drift      │  Target-feature shift     │  CSI            │
│                                                                  │
│  GOLDEN RULES:                                                   │
│  ✅ Never use protected attributes (race, gender) directly       │
│  ✅ Watch for PROXY variables (zip code → race)                  │
│  ✅ Audit across subgroups, not just overall                     │
│  ✅ Monitor production drift weekly/monthly                      │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

📈 9. CLASSIFICATION METRICS

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  METRIC      │  FORMULA              │  USE WHEN                  │
│  ────────────┼───────────────────────┼──────────                  │
│  Accuracy    │  (TP+TN)/Total        │  Balanced data             │
│  Precision   │  TP/(TP+FP)           │  False positives costly    │
│  Recall      │  TP/(TP+FN)           │  False negatives costly    │
│  F1          │  2·P·R/(P+R)          │  Imbalanced data           │
│  AUC-ROC     │  Area under ROC       │  Ranking quality ⭐         │
│  AUC-PR      │  Area under P-R       │  Severe imbalance          │
│  Log Loss    │  -Σ y·log(p)          │  Probabilistic models      │
│  KS          │  max(TPR - FPR)       │  Credit risk               │
│                                                                  │
│  CONFUSION MATRIX:                                               │
│                                                                  │
│                    │  Predicted YES   │  Predicted NO            │
│   ─────────────────┼──────────────────┼──────────────            │
│   Actual YES       │  TP (✅)         │  FN (😭 missed)          │
│   Actual NO        │  FP (😅 wrong)   │  TN (✅)                 │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

📈 10. REGRESSION METRICS

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  METRIC      │  WHAT IT MEASURES        │  USE WHEN                │
│  ────────────┼──────────────────────────┼──────────                │
│  MAE         │  Avg absolute error      │  Interpretable             │
│              │  Same unit as target     │  Outliers present        │
│  MSE         │  Avg squared error       │  Used internally           │
│  RMSE        │  √MSE — same unit        │  DEFAULT ⭐                │
│              │  Punishes big errors     │                          │
│  R²          │  Variance explained      │  Business explanation    │
│              │  1=perfect, 0=mean       │                          │
│  MAPE        │  Avg % error             │  Forecasting             │
│              │  Bad when target ≈ 0     │                          │
│                                                                  │
│  WHICH TO USE:                                                   │
│  • Default → RMSE                                                │
│  • Interpretable → MAE                                           │
│  • Business explanation → R²                                     │
│  • Forecasting → MAPE                                            │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

🔍 11. CLUSTERING EVALUATION

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  METRIC         │  ONE-LINER                                     │
│  ───────────────┼────────────────────                            │
│  Silhouette     │  How well a point fits its cluster vs others   │
│                 │  Range: -1 to +1, higher = better              │
│                                                                  │
│  Inertia        │  Sum of squared distances from points to       │
│                 │  their centroids (the RAW NUMBER)              │
│                 │  Lower = tighter clusters                      │
│                                                                  │
│  Elbow Method   │  The TECHNIQUE of plotting Inertia for         │
│                 │  multiple K values and picking the "bend"      │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

💡 Inertia vs Elbow Method — The Key Difference

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  INERTIA = A SINGLE NUMBER                                       │
│            (one value for one K — e.g., K=5 → inertia=12,345)    │
│                                                                  │
│  ELBOW METHOD = A TECHNIQUE that USES INERTIA                    │
│                 (plot inertia for K=2,3,4...14 → find the bend)  │
│                                                                  │
│  Analogy:                                                        │
│    Inertia      = a thermometer reading (one number)             │
│    Elbow Method = the technique of watching readings over time   │
│                                                                  │
│  USAGE:                                                          │
│    1. Calculate inertia for K=2 to K=15                          │
│    2. Plot inertia vs K (this PLOT = Elbow Method)               │
│    3. Pick K at the "elbow bend" (diminishing returns point)     │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

🎨 Visual Example

Code
   Inertia
       │●  ← K=2, very high inertia
       │ ●
       │  ●
       │   ●●  ← Big drops (each K adds real value)
       │     ●●
       │        ●●●●●  ← ELBOW! (K=5 or 6 optimal)
       │             ●●●●●●  ← Tiny drops (diminishing returns)
       │                   ●●●●●
       └───────────────────────── K
       2  3  4  5  6  7  8  9 10

🎯 12. THE PRO'S QUICK FACTS

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  💡 KEY INSIGHTS:                                                │
│                                                                  │
│  ✅ XGBoost = best for tabular data (90% of business problems)   │
│  ✅ CNN = default for image classification (ResNet, EfficientNet)│
│  ✅ Transformers (BERT) = default for NLP                        │
│  ✅ LSTM/Prophet = default for time series                       │
│  ✅ K-Means = default clustering when you know K                 │
│  ✅ DBSCAN = clustering when you don't know K                    │
│  ✅ Isolation Forest = default for anomaly detection             │
│  ✅ PCA = default for dimensionality reduction                   │
│  ✅ LightGBM = faster XGBoost for big data                       │
│  ✅ Hyperopt / Optuna = default for hyperparameter tuning        │
│  ✅ MLflow = default for experiment tracking                     │
│  ✅ SHAP = default for explainability                            │
│  ✅ PSI / CSI = default for production drift monitoring          │
│  ✅ IV + VIF = default for feature selection in credit risk      │
│  ✅ Spark/Databricks = default for big data ML                   │
│                                                                  │
│  ⚠️ COMMON PITFALLS:                                             │
│                                                                  │
│  ❌ Linear models without scaling                                │
│  ❌ K-Means without standardization                              │
│  ❌ Ignoring class imbalance in classification                   │
│  ❌ Using accuracy on imbalanced data                            │
│  ❌ Not validating on out-of-time data                           │
│  ❌ Forgetting to check for data leakage                         │
│  ❌ Trusting feature importance from correlated features         │
│  ❌ Deploying without monitoring (PSI, drift)                    │
│  ❌ Skipping feature selection (IV/VIF) in regulated domains     │
│  ❌ No baseline model before going complex                       │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

💡 The Three Rules to Live By

Code
┌──────────────────────────────────────────────────────────────────┐
│                                                                  │
│  RULE 1: Start simple. Beat baseline before going complex.       │
│                                                                  │
│  RULE 2: Trust evaluation, not algorithm hype.                   │
│          A simple model with good evaluation beats a fancy       │
│          model with poor validation.                             │
│                                                                  │
│  RULE 3: Production starts at modeling, not after.               │
│          Think monitoring, drift, fairness from day 1.           │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

No comments:

Post a Comment