Feature Selection Techniques for Regression & Classification

Feature selection techniques can be grouped into three main stages:

Filter Methods (Before Modeling) – purely statistical or rule-based, no model needed.
Embedded Methods (During Modeling) – selection happens while training.
Wrapper Methods (After Modeling) – iterative, model-based evaluation.

1. Filter Methods — Pre-Model Selection

These rely on statistical tests and relationships between features and target.
They are fast, model-agnostic, and help remove irrelevant or redundant features early.

Technique	Works For	Purpose / Usage	Advantages	Disadvantages
Correlation Analysis	Regression & Classification	Measures linear relationship between features and target (e.g., Pearson, Spearman).	Simple, quick redundancy detection.	Only captures linear relationships, ignores interactions.
VIF (Variance Inflation Factor)	Regression	Detects multicollinearity in predictors to improve regression stability.	Identifies redundant predictors.	Only applies to linear regression; needs numerical/dummy-encoded data.
IV (Information Value)	Classification (binary)	Quantifies a variable’s ability to separate two classes.	Interpretable, great for credit scoring.	Binary classification only; needs binning for continuous data.
Chi-Square Test	Classification	Tests statistical dependence between categorical features and target.	Works well with categorical data.	Requires categorical variables; not for continuous targets.
ANOVA F-test	Regression & Classification	Tests if means of numerical feature differ significantly across target groups.	Good for numerical vs categorical target relationship.	Assumes normally distributed data; no interactions.
Mutual Information	Regression & Classification	Measures general dependency (linear + nonlinear) between feature and target.	Captures non-linear relationships.	Computationally heavier than correlation.
Bivariate Analysis	Regression & Classification	Compares target statistics (mean, proportion) across feature bins/categories.	Easy visual interpretation.	Summarizes only one feature at a time; no interactions.

2. Embedded Methods — Model-Integrated Selection

Feature selection happens during model training, influenced by the learning algorithm.
Best for keeping predictive and interpretable features.

Technique	Works For	Purpose / Usage	Advantages	Disadvantages
L1 Regularization (Lasso)	Regression & Classification	Shrinks less important feature coefficients to zero, effectively removing them.	Produces sparse, interpretable models.	May drop correlated useful features.
Elastic Net	Regression & Classification	Combines L1 (sparse) and L2 (stability) penalties for balanced selection.	Handles correlated features better than Lasso.	Needs hyperparameter tuning.
Tree-based Feature Importance	Regression & Classification	Measures how much each feature reduces impurity (Gini, MSE) in tree splits.	Works for non-linear, interaction-rich data.	Can be biased towards high-cardinality features.
SHAP Values	Regression & Classification	Model-agnostic method attributing contributions of features to predictions.	Explains both global and per-instance effects.	Computationally expensive for large datasets.

3. Wrapper Methods — Iterative Search

These methods repeatedly train and test models with different subsets of features to find the best combination.

Technique	Works For	Purpose / Usage	Advantages	Disadvantages
Recursive Feature Elimination (RFE)	Regression & Classification	Iteratively removes least important features until desired number is reached.	Finds optimal subset for model performance.	Slow for large datasets; requires multiple model fits.
Sequential Feature Selection	Regression & Classification	Adds or removes features step-by-step based on model performance.	Simple and interpretable process.	Computationally expensive.
Permutation Importance	Regression & Classification	Measures drop in model performance when a feature’s values are shuffled.	Works for any model; easy to interpret.	Requires trained model; may be unstable with correlated features.

4. Specialized / Dimensionality Reduction & Domain Knowledge

These are niche but powerful, especially for high-dimensional data.

Technique	Works For	Purpose / Usage	Advantages	Disadvantages
Principal Component Analysis (PCA)	Regression & Classification	Transforms correlated features into uncorrelated components.	Reduces dimensionality while keeping variance.	Components lose original feature meaning.
Domain Knowledge Filtering	Both	Remove irrelevant features based on business or scientific understanding.	Improves interpretability and model relevance.	Relies on expert input; risk of bias.

Practical Usage Flow

Initial Filtering → Correlation, IV, VIF, Chi-Square, Mutual Information, ANOVA.
Model-Integrated Refinement → Lasso, Elastic Net, Tree-based Importance, SHAP.
Performance Optimization → RFE, Sequential Selection, Permutation Importance.
Special Cases → PCA for dimensionality reduction, domain filtering for expert-driven refinement.

Bigdata and data science by Kartheek Dachepalli

Saturday, August 9, 2025

Feature selection techniques guide