Saturday, June 20, 2026

📘 XGBoost: The Complete Interview Cheat Sheet

 🎯 What Is XGBoost?

XGBoost = eXtreme Gradient Boosting

It's a machine learning algorithm that builds many small decision trees one after another, where each new tree learns from the mistakes of the previous ones.

The Doctor Analogy 🏥

AlgorithmAnalogy
Single Decision TreeOne doctor diagnoses you
Random Forest100 doctors diagnose independently → vote
XGBoostA chain of doctors where each new doctor focuses on what previous doctors got WRONG

🌳 The Big Picture: How XGBoost Builds Trees

Code
┌────────────────────────────────────────────────────────────────┐
│                                                                │
│  Initial Guess (0.5)                                           │
│         │                                                      │
│         ▼                                                      │
│   ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    │
│   │ Tree 1  │ →  │ Tree 2  │ →  │ Tree 3  │ →  │ Tree N  │    │
│   │ learns  │    │ fixes   │    │ fixes   │    │ fixes   │    │
│   │ from y  │    │ Tree 1's│    │ Tree 2's│    │ leftover│    │
│   │         │    │ errors  │    │ errors  │    │ errors  │    │
│   └─────────┘    └─────────┘    └─────────┘    └─────────┘    │
│         │              │              │              │         │
│         └──────────────┴──────┬───────┴──────────────┘         │
│                               ▼                                │
│                  Sum all outputs × learning rate               │
│                               │                                │
│                               ▼                                │
│                         Apply Sigmoid                          │
│                               │                                │
│                               ▼                                │
│                     Probability (0 to 1)                       │
└────────────────────────────────────────────────────────────────┘

🔢 Step-by-Step: Binary Classification Example

Dataset: Predict if student passes exam (0 or 1)

StudentHoursSleepActual
A280
B460
C671
D851

Step 1: Start with Base Prediction

Code
base_prob = 0.5  →  base_log_odds = log(0.5/0.5) = 0

Every student starts at probability = 0.5.

Step 2: Calculate Residuals (Mistakes)

Code
Residual = Actual − Predicted Probability
StudentActualPredictedResidual
A00.5−0.5
B00.5−0.5
C10.5+0.5
D10.5+0.5

Step 3: Build Tree #1 to Predict Residuals

The tree tries different splits and picks the one with maximum Gain:

Similarity=(Residuals)2[p(1p)]+λ

$$\text{Gain} = \text{Sim}{Left} + \text{Sim}{Right} - \text{Sim}_{Root}$$

Best split: Hours > 5?

Code
        Hours > 5?
        /        \
       No         Yes
      [A,B]     [C,D]
    Output:-2  Output:+2

Step 4: Update Predictions (Learning Rate η = 0.3)

Code
New log-odds = Old log-odds + η × Tree Output
StudentOldTreeNew log-oddsNew Prob
A0−2−0.60.354
B0−2−0.60.354
C0+2+0.60.646
D0+2+0.60.646

✅ A & B moved toward 0, C & D moved toward 1

Step 5: Repeat for Tree #2, #3, ... #N

Each tree learns from the new (smaller) residuals.

Step 6: Final Probability

P(y=1)=σ(base+ηi=1NTreei)

σ(z)=11+ez


🌲 How Each Tree Is Built (Tree Internals)

Q1: Is Each Tree Just One Split?

NO! Each tree has multiple depths (default max_depth = 6 → up to 64 leaves).

Code
max_depth=1 (stump):       max_depth=3 (typical):
                           
   Hours > 5?              Hours > 5?
   /       \              /         \
  No       Yes        Sleep>6?    Sleep>7?
                       / \         / \
                     ... ...     ... ...

Q2: How Does It Pick Splits?

Greedy Exhaustive Search at every node:

  1. Try every feature
  2. Try every split point for each feature
  3. Pick the split with highest Gain
  4. Repeat for child nodes until max_depth or gain < γ (prune)

Q3: Is Column Selection Random?

Default: NO — uses ALL columns at every node.

But you can enable randomness:

ParameterWhat It Does
colsample_bytreeRandom columns per tree
colsample_bylevelRandom columns per depth level
colsample_bynodeRandom columns per split

🎲 Getting Probabilities: The Full Math

Code
┌─────────────────────────────────────────────────────────┐
│                                                         │
│  Input: Charlie (Hours=6, Sleep=7)                      │
│                                                         │
│  Step 1: Base log-odds = 0                              │
│  Step 2: Tree 1 output = +2.0                           │
│  Step 3: Tree 2 output = +1.5                           │
│  Step 4: Tree 3 output = +0.8                           │
│                                                         │
│  Final log-odds = 0 + 0.3 × (2.0 + 1.5 + 0.8) = 1.29   │
│                                                         │
│  Probability = 1 / (1 + e^(-1.29)) = 0.784              │
│                                                         │
│  Class = 1 (since 0.784 ≥ 0.5) ✅                       │
│                                                         │
└─────────────────────────────────────────────────────────┘

Why log-odds internally?

  • Probabilities are bounded (0–1) — adding tree outputs could exceed limits
  • Log-odds are unbounded — safe to add
  • Sigmoid converts back to probability at the end

Training vs Prediction Parallelism

TrainingPrediction
Across trees❌ Sequential (must be)Fully parallel
Within a tree✅ Parallel features✅ Parallel traversal
Across rows✅ Yes✅ Yes


No comments:

Post a Comment