Feature Selection Techniques for Regression & Classification
Feature selection techniques can be grouped into three main stages:
- 
Filter Methods (Before Modeling) – purely statistical or rule-based, no model needed. 
- 
Embedded Methods (During Modeling) – selection happens while training. 
- 
Wrapper Methods (After Modeling) – iterative, model-based evaluation. 
1. Filter Methods — Pre-Model Selection
These rely on statistical tests and relationships between features and target.
They are fast, model-agnostic, and help remove irrelevant or redundant features early.
| Technique | Works For | Purpose / Usage | Advantages | Disadvantages | 
|---|---|---|---|---|
| Correlation Analysis | Regression & Classification | Measures linear relationship between features and target (e.g., Pearson, Spearman). | Simple, quick redundancy detection. | Only captures linear relationships, ignores interactions. | 
| VIF (Variance Inflation Factor) | Regression | Detects multicollinearity in predictors to improve regression stability. | Identifies redundant predictors. | Only applies to linear regression; needs numerical/dummy-encoded data. | 
| IV (Information Value) | Classification (binary) | Quantifies a variable’s ability to separate two classes. | Interpretable, great for credit scoring. | Binary classification only; needs binning for continuous data. | 
| Chi-Square Test | Classification | Tests statistical dependence between categorical features and target. | Works well with categorical data. | Requires categorical variables; not for continuous targets. | 
| ANOVA F-test | Regression & Classification | Tests if means of numerical feature differ significantly across target groups. | Good for numerical vs categorical target relationship. | Assumes normally distributed data; no interactions. | 
| Mutual Information | Regression & Classification | Measures general dependency (linear + nonlinear) between feature and target. | Captures non-linear relationships. | Computationally heavier than correlation. | 
| Bivariate Analysis | Regression & Classification | Compares target statistics (mean, proportion) across feature bins/categories. | Easy visual interpretation. | Summarizes only one feature at a time; no interactions. | 
2. Embedded Methods — Model-Integrated Selection
Feature selection happens during model training, influenced by the learning algorithm.
Best for keeping predictive and interpretable features.
| Technique | Works For | Purpose / Usage | Advantages | Disadvantages | 
|---|---|---|---|---|
| L1 Regularization (Lasso) | Regression & Classification | Shrinks less important feature coefficients to zero, effectively removing them. | Produces sparse, interpretable models. | May drop correlated useful features. | 
| Elastic Net | Regression & Classification | Combines L1 (sparse) and L2 (stability) penalties for balanced selection. | Handles correlated features better than Lasso. | Needs hyperparameter tuning. | 
| Tree-based Feature Importance | Regression & Classification | Measures how much each feature reduces impurity (Gini, MSE) in tree splits. | Works for non-linear, interaction-rich data. | Can be biased towards high-cardinality features. | 
| SHAP Values | Regression & Classification | Model-agnostic method attributing contributions of features to predictions. | Explains both global and per-instance effects. | Computationally expensive for large datasets. | 
3. Wrapper Methods — Iterative Search
These methods repeatedly train and test models with different subsets of features to find the best combination.
| Technique | Works For | Purpose / Usage | Advantages | Disadvantages | 
|---|---|---|---|---|
| Recursive Feature Elimination (RFE) | Regression & Classification | Iteratively removes least important features until desired number is reached. | Finds optimal subset for model performance. | Slow for large datasets; requires multiple model fits. | 
| Sequential Feature Selection | Regression & Classification | Adds or removes features step-by-step based on model performance. | Simple and interpretable process. | Computationally expensive. | 
| Permutation Importance | Regression & Classification | Measures drop in model performance when a feature’s values are shuffled. | Works for any model; easy to interpret. | Requires trained model; may be unstable with correlated features. | 
4. Specialized / Dimensionality Reduction & Domain Knowledge
These are niche but powerful, especially for high-dimensional data.
| Technique | Works For | Purpose / Usage | Advantages | Disadvantages | 
|---|---|---|---|---|
| Principal Component Analysis (PCA) | Regression & Classification | Transforms correlated features into uncorrelated components. | Reduces dimensionality while keeping variance. | Components lose original feature meaning. | 
| Domain Knowledge Filtering | Both | Remove irrelevant features based on business or scientific understanding. | Improves interpretability and model relevance. | Relies on expert input; risk of bias. | 
Practical Usage Flow
- 
Initial Filtering → Correlation, IV, VIF, Chi-Square, Mutual Information, ANOVA. 
- 
Model-Integrated Refinement → Lasso, Elastic Net, Tree-based Importance, SHAP. 
- 
Performance Optimization → RFE, Sequential Selection, Permutation Importance. 
- 
Special Cases → PCA for dimensionality reduction, domain filtering for expert-driven refinement. 
 
No comments:
Post a Comment