🎯 What Is XGBoost?
XGBoost = eXtreme Gradient Boosting
It's a machine learning algorithm that builds many small decision trees one after another, where each new tree learns from the mistakes of the previous ones.
The Doctor Analogy 🏥
| Algorithm | Analogy |
|---|---|
| Single Decision Tree | One doctor diagnoses you |
| Random Forest | 100 doctors diagnose independently → vote |
| XGBoost | A chain of doctors where each new doctor focuses on what previous doctors got WRONG |
🌳 The Big Picture: How XGBoost Builds Trees
🔢 Step-by-Step: Binary Classification Example
Dataset: Predict if student passes exam (0 or 1)
| Student | Hours | Sleep | Actual |
|---|---|---|---|
| A | 2 | 8 | 0 |
| B | 4 | 6 | 0 |
| C | 6 | 7 | 1 |
| D | 8 | 5 | 1 |
Step 1: Start with Base Prediction
Every student starts at probability = 0.5.
Step 2: Calculate Residuals (Mistakes)
| Student | Actual | Predicted | Residual |
|---|---|---|---|
| A | 0 | 0.5 | −0.5 |
| B | 0 | 0.5 | −0.5 |
| C | 1 | 0.5 | +0.5 |
| D | 1 | 0.5 | +0.5 |
Step 3: Build Tree #1 to Predict Residuals
The tree tries different splits and picks the one with maximum Gain:
$$\text{Gain} = \text{Sim}{Left} + \text{Sim}{Right} - \text{Sim}_{Root}$$
Best split: Hours > 5?
Step 4: Update Predictions (Learning Rate η = 0.3)
| Student | Old | Tree | New log-odds | New Prob |
|---|---|---|---|---|
| A | 0 | −2 | −0.6 | 0.354 |
| B | 0 | −2 | −0.6 | 0.354 |
| C | 0 | +2 | +0.6 | 0.646 |
| D | 0 | +2 | +0.6 | 0.646 |
✅ A & B moved toward 0, C & D moved toward 1
Step 5: Repeat for Tree #2, #3, ... #N
Each tree learns from the new (smaller) residuals.
Step 6: Final Probability
🌲 How Each Tree Is Built (Tree Internals)
Q1: Is Each Tree Just One Split?
NO! Each tree has multiple depths (default max_depth = 6 → up to 64 leaves).
Q2: How Does It Pick Splits?
Greedy Exhaustive Search at every node:
- Try every feature
- Try every split point for each feature
- Pick the split with highest Gain
- Repeat for child nodes until
max_depthorgain < γ(prune)
Q3: Is Column Selection Random?
Default: NO — uses ALL columns at every node.
But you can enable randomness:
| Parameter | What It Does |
|---|---|
colsample_bytree | Random columns per tree |
colsample_bylevel | Random columns per depth level |
colsample_bynode | Random columns per split |
🎲 Getting Probabilities: The Full Math
Why log-odds internally?
- Probabilities are bounded (0–1) — adding tree outputs could exceed limits
- Log-odds are unbounded — safe to add
- Sigmoid converts back to probability at the end
Training vs Prediction Parallelism
| Training | Prediction | |
|---|---|---|
| Across trees | ❌ Sequential (must be) | ✅ Fully parallel |
| Within a tree | ✅ Parallel features | ✅ Parallel traversal |
| Across rows | ✅ Yes | ✅ Yes |
No comments:
Post a Comment