⚖️ Handling Imbalanced Classes (1% Event Rate)
The Problem
- 9,900 negatives + 100 positives
- A model that always predicts 0 gets 99% accuracy (useless!)
XGBoost's Three Defenses
🛡️ Defense 1: scale_pos_weight
Python
This multiplies positive class gradients by 99 → makes them equally important mathematically.
🛡️ Defense 2: The Boosting Mechanism Itself
Code
🛡️ Defense 3: Imbalance-Friendly Metric
Python
🧠 The Key Insight: How "Aggregate Math" Still Catches the 1%
Even though each split aggregates ALL records, the squared residual sum in the Gain formula rewards CONCENTRATION. Splits that isolate the minority class produce disproportionately high Gain.
Code
The math automatically finds splits where rare classes cluster together.
🕳️ Handling Missing Values (Nulls) — XGBoost's Magic
The Sparsity-Aware Algorithm
At every split, XGBoost asks 3 questions:
- What's the Gain if missing values go LEFT?
- What's the Gain if missing values go RIGHT?
- Which wins? → That becomes the default direction for this split
Code
Why This Is Genius
- Missing = Information (often correlates with something meaningful)
- No manual imputation needed — just pass NaN directly
- Different defaults per split — adapts to context
- Only 2 extra calculations per split — almost free
Python
No comments:
Post a Comment